2  Introduction

from google.colab import drive
drive.mount('/content/drive')

2.1 [User]

I want to learn python and use it for molecular biology research such as design of library sequence, design of experiments, read high-throughput sequencing file, apply deep learning algorithm. What topics do i need to learn?

2.2 [ChatGPT]

Learning Python for molecular biology research involves a blend of general programming skills and domain-specific knowledge. Here’s a structured guide on the topics you need to cover:

2.2.1 Table of Contents

2.2.1.1 Basic Python Programming

  • Python Syntax and Semantics: Learn basic syntax, data types (strings, lists, dictionaries), control structures (if statements, loops), functions, and error handling.
  • Modules and Packages: Understand how to import and use standard libraries and third-party packages.
  • File I/O: Learn how to read from and write to files.

2.2.1.2 Data Handling and Analysis

  • NumPy: For numerical operations and handling arrays.
  • Pandas: For data manipulation and analysis, essential for handling tabular data, such as gene expression matrices.
  • Matplotlib/Seaborn: For data visualization to generate plots and graphs.

2.2.1.3 Bioinformatics Libraries

  • Biopython: Essential for biological computation. Learn how to read/write sequence files, perform sequence analysis, and interact with biological databases.
  • Scikit-bio: For biological analysis including sequence alignment, biodiversity analysis, etc.

2.2.1.4 Sequence Analysis

  • FASTA/FASTQ Parsing: Learn how to parse sequence files.
  • Sequence Alignment: Understand the basics of sequence alignment (global and local), and tools like BLAST.
  • Motif Finding: Learn how to identify sequence motifs.

2.2.1.5 High-Throughput Sequencing Data

  • HTSeq: For reading and analyzing high-throughput sequencing data.
  • PySAM: For reading and manipulating SAM/BAM files (aligned sequencing reads).
  • SeqIO: Part of Biopython for input/output of sequence data.

2.2.1.6 Experimental Design

  • Primer Design: Understand the principles of primer design and tools available (like Primer3) and their Python interfaces.
  • Golden Gate Assembly and Gibson Assembly: Learn the principles and how to design experiments programmatically.

2.2.1.7 Statistical Analysis

  • SciPy: For scientific and technical computing.
  • Statsmodels: For statistical modeling.
  • R Integration: Learn how to use R from Python (via rpy2) if you need to use specific R packages for statistical analysis.

2.2.1.8 Machine Learning and Deep Learning

  • Scikit-learn: For machine learning algorithms.
  • TensorFlow/Keras or PyTorch: For deep learning.
  • Bioinformatics-specific models: Learn about models specific to bioinformatics, such as predicting protein structures or gene expression levels.

2.2.1.9 Workflow Automation

  • Snakemake: For workflow management.
  • Nextflow: Another workflow management tool, useful for reproducible research.

2.2.1.10 Advanced Topics

  • Data Integration: Integrating different types of biological data (genomic, transcriptomic, proteomic).
  • Network Biology: Understanding biological networks (e.g., protein-protein interaction networks).
  • Structural Bioinformatics: Analyzing protein structures using tools like PyMOL.

2.2.1.11 Learning Resources

  1. Books
    • “Python for Biologists” by Martin Jones
    • “Bioinformatics Programming Using Python” by Mitchell L. Model
  2. Online Courses
    • Coursera: “Python for Genomic Data Science” by Johns Hopkins University
    • edX: “Python for Data Science” by Microsoft
    • Biostars: Online forum for bioinformatics questions and tutorials.
  3. Tutorials and Documentation
    • Biopython Tutorial and Cookbook: Comprehensive guide to using Biopython.
    • NumPy and Pandas documentation: Essential for data manipulation.
    • Scikit-learn and TensorFlow/Keras documentation: For machine learning and deep learning.

By following this structured guide, you’ll be able to leverage Python effectively for molecular biology research.

2.3 [User]

I’ll learn it using google colab. can you let me know how to setup the colab environment for this learning

2.4 [ChatGPT]

Certainly! Google Colab is an excellent platform for learning and running Python code, especially for bioinformatics. It provides a cloud-based Jupyter notebook environment that comes pre-configured with many scientific libraries. Here’s how to set up and use Google Colab for learning bioinformatics with Python:

2.4.1 Step-by-Step Guide to Setting Up Google Colab

2.4.1.1 Access Google Colab

  • Go to Google Colab in your web browser.
  • Sign in with your Google account.

2.4.1.2 Create a New Notebook

  • Click on File -> New Notebook to create a new notebook.

2.4.1.3 Install Required Libraries

Google Colab comes with many libraries pre-installed, but you might need to install some additional ones, such as biopython and scikit-bio. You can do this using the !pip install command directly in a cell.

!pip install biopython scikit-bio

2.4.1.4 Import Libraries and Verify Installation

In a new code cell, import the libraries to ensure they are installed correctly.

# Importing necessary libraries
import Bio
import skbio

print("Biopython version:", Bio.__version__)
print("scikit-bio version:", skbio.__version__)
Biopython version: 1.83
scikit-bio version: 0.6.0

2.4.1.5 Upload Files to Colab

  • 구글 드라이브에 2024-cnu-lecture 폴더를 만들기

  • 다운로드 하거나 공유된 (ipynb) 파일을 2024-cnu-lecture 폴더에 복사후 colab으로 open

  • 분석을 위한 ganbank 파일 ncbi에서 다운로드

  • 구글 드라이브의 위 디렉토리에 genbank 파일 복사 또는 다음 코드로 로컬 파일을 업로드 가능

  • 현재 작업 디렉토리 확인

!pwd
  • 현재 작업 디렉토리를 위 생성한 디렉토리로 변경
import os
os.chdir('drive/MyDrive/2024-cnu-lecture')
!pwd
  • 분석을 위한 genbank 등의 파일을 ncbi에서 다운로드 후 위 폴더에 복사
  • 또는 아래 코드를 이용해서 현재 작업 디렉토리에 업로드
from google.colab import files

uploaded = files.upload()

# Listing the uploaded files
for filename in uploaded.keys():
    print(filename)

image.png

2.4.1.6 Example: Reading a GenBank File and Analyzing Sequences

  • Here’s an example workflow that reads a GenBank file, performs sequence analysis with Biopython, and aligns sequences with scikit-bio:
  • 위에서 다운로드 받은 genbank 파일을 egfp.gb 이름으로 변경 후 업로드
  • 아래 예시 코드는 단백질 서열을 읽고 비교하는 코드임
from Bio import SeqIO
from Bio.Seq import Seq
from skbio import Protein
from skbio.alignment import global_pairwise_align_protein

# Assuming the uploaded GenBank file is named 'example.gb'
genbank_file = 'data/egfp.gb'

# Read a protein sequence from a GenBank file
record = SeqIO.read(genbank_file, "genbank")
print(len(record.seq))

protein_seq = record.seq.translate()
print(protein_seq)

# Perform alignment with another protein sequence
seq1 = Protein(str(protein_seq))
seq2 = Protein("MKVLYNLKDG")

alignment, score, start_end_positions = global_pairwise_align_protein(seq1, seq2)

print("Alignment:\n", alignment)
print("Score:", score)
print("Start-End Positions:", start_end_positions)
1150
MSRVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQEVDLCILLS*EVELFPFCVLQLQMEIHRDTRGFCLVLSFVLVVN*KFIHLQEINSLLFKSKGLTYVSVKHWIPKNSKMR*GIFNLVWTSNRLG*PPVKLQQDCLLTQCLRSDVQ*LVLFHFALDFLFANELCF*WFSLFWCVENHLLQGRRQLQDPRRGEVRGRHPGEPHRAEGHRLQGGRQHPGAQAGVQLQQPQRLYHGRQAEERHQGELQDPPQHRGRQRAARRPLPAEHPHRRRPRAAARQPLPEHPVRPEQRPQREARSHGPAGVRDRRRDHSRHGRAVQV
Alignment:
 TabularMSA[Protein]
-----------------------------------------------------------------------
Stats:
    sequence count: 2
    position count: 384
-----------------------------------------------------------------------
MSRVSKGEELFTGVVPILVEL-DGDVNGHKFSV ... PEQRPQREARSHGPAGVRDRRRDHSRHGRAVQV
--------------MKVLYNLKDG--------- ... ---------------------------------
Score: 18.0
Start-End Positions: [(0, 382), (0, 9)]
/home/haseong/anaconda3/envs/biopy/lib/python3.11/site-packages/Bio/Seq.py:2880: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
/home/haseong/anaconda3/envs/biopy/lib/python3.11/site-packages/skbio/alignment/_pairwise.py:538: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/scikit-bio/scikit-bio/issues/254 to track progress on this).
  warn(

2.4.1.7 Saving Your Work

You can save your notebook to Google Drive:

  • Click on File -> Save a copy in Drive.

2.4.1.8 Sharing Your Notebook

You can share your notebook with others by clicking on the Share button in the top right corner and setting the appropriate sharing permissions.

2.4.1.9 Example Colab Notebook

Here’s a more comprehensive example notebook that includes reading files, sequence analysis, and visualization: (다음 코드 참고)

This notebook covers: - Installing and importing bioinformatics libraries. - Uploading and reading files. - Performing sequence analysis. - Visualizing data.

By following these steps and examples, you’ll be well-prepared to use Google Colab for your bioinformatics learning and research.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Example data for visualization
data = {
    'Sequence': ['Seq1', 'Seq2', 'Seq3'],
    'GC_Content': [50, 55, 45],
    'Length': [20, 22, 18]
}

# Visualize using Seaborn
df = pd.DataFrame(data)
sns.histplot(df['GC_Content'], kde=True)
plt.title('GC Content Distribution')
plt.xlabel('GC Content (%)')
plt.show()

sns.boxplot(x=df['Length'])
plt.title('Sequence Length Distribution')
plt.xlabel('Length (bp)')
plt.show()