Author

Andrzej Tkacz atkacz@ualg.pt, David Paleček dpalecek@ualg.pt

Published

16/04/2025

Doi

Simple Linux – An Introduction to Genomic Analysis on Linux

Large files or datasets—especially those containing genomic data—no problem. This workshop introduces essential Linux commands and simple Bash scripts to streamline data manipulation tasks. We’ll cover key operations such as searching for patterns, globally modifying content, and aligning DNA sequences, such as FASTAQ and FASTA files.

Setup

No extra setup needed for Linux and MacOS machines. For windows the recommended way is to install VBox. See the steps to follow:

Here are the instructions: Installing Ubuntu on VirtualBox

Step 1: Install VirtualBox

Run the Installer:
After downloading the VirtualBox installer (.exe file for WinOS, .dmg for MacOS), double-click on it to launch the setup wizard and follow default installation instructions.
Launch VirtualBox.

Step 2: Create a New Virtual Machine for Ubuntu

Open VirtualBox and Create a New VM: Click on the New button.
Name and Operating System: Give your VM a name (e.g., “Ubuntu”), select “Linux” as the type, and choose Ubuntu (64-bit) as the version.
Assign Memory: Allocate at least 2048 MB (2 GB) of RAM (or higher if your system permits) to ensure smooth operation.
Create a Virtual Hard Disk:
- Choose Create a virtual hard disk now.
- Select the VDI (VirtualBox Disk Image) format, and set the disk size (a minimum of 20 GB is recommended).

Step 3: Configure and Install Ubuntu

Attach the Ubuntu ISO after downloading:
- Select your new VM, click on Settings, then navigate to the Storage section.
- Under the “Controller: IDE” (or SATA), click on the empty optical drive.
- Click the small disk icon on the right and choose Choose a disk file. Browse to and select the Ubuntu ISO you downloaded.
Start the Virtual Machine:
- With the ISO attached, click Start.
- Note
  
  You might need to restart your machine
- The VM will boot from the ISO, and you should see the Ubuntu welcome screen.
Install Ubuntu:
- Follow the on-screen instructions in the Ubuntu installer:
  - Choose your language and keyboard layout.
  - Select Install Ubuntu.
  - Follow prompts to set up your timezone, create a user account, and configure your disk (use the default settings for a virtual disk).
  - Once the installation is complete, you’ll be prompted to restart the VM.

Command line

Why

Eventually you will need bigger compute than you can get on your machine and you will be forced to work on Cloud such as High Power cluster (HPC) which allow you to parallelize and run heavy computations.
Things can be done faster on a command line with practice.

Basics

Open the terminal and let’s create a file called maria_joanna_song

touch maria_joanna_song

let’s just copy in this text - open the file with double click and paste this song into the maria_joanna_song file, also found here

E virou!
Onde anda essa Maria?
Eu só tenho essa Maria
Eu vim do norte direto a Lisboa
Atrás de um sonho
Que eu nem sei se voa
Tanto quanto nós voávamos
Debaixo dos lencóis
A francesinha já não tem picante
E isso faz-me lembrar os instantes
Em que tu mordias os meus lábios
Depois do amor
Não sei se vou aguentar
O tempo que vou cá ficar
Porque não estás
Porque não estás aqui
As lágrimas vão secar
Mas a saudade vai continuar
No meu peito a chamar
Maria Joana, Maria Joana
Apanha o primeiro autocarro
Vem ficar pra sempre do meu lado
Maria Joana, Maria Joana
Apanha o primeiro autocarro
Vem ficar pra sempre do meu lado
Minha mãe nunca me viu partir
Mas todo o filho um dia voa
Saudade, saudade
Mamã cuida dela por mim
Diz-me porque não estás aqui
Maria, sem ti fico à toa
Saudade, saudade
Maria, eu quero-te ver
Não sei se vou aguentar
O tempo que vou cá ficar
Porque a saudade aperta o meu peito
E dói demais (dói demais)
As lágrimas vão secar
Mas a saudade vai continuar
No meu peito a chamar
Maria Joana, Maria Joana
Apanha o primeiro autocarro
Vem ficar pra sempre do meu lado
Maria Joana (Maria), Maria Joana
Apanha o primeiro autocarro
Vem ficar pra sempre do meu lado
Maria, Maria, Maria
Maria, Maria, Maria
(Maria, Maria, Maria, Maria)
Maria, Maria, Maria
Maria, Maria, Maria
(Maria, Maria, Maria, Maria)
Quantas lágrimas chorei
Quantas noites não dormi
Cada gota que eu deitei
Fez um rio que me leva a ti
Maria Joana
Maria Joana (quantas lágrimas chorei)
Apanha o primeiro autocarro
Vem ficar pra sempre
Do meu lado (quantas noites não dormi)
Maria Joana (ai, Maria)
Maria Joana (cada gota que eu deitei)
Apanha o primeiro autocarro
Vem ficar pra sempre do meu lado
Maria, Maria, Maria
Maria, Maria, Maria
(Maria, Maria, Maria, Maria)
Maria, Maria, Maria
Maria, Maria, Maria
(Maria, Maria, Maria, Maria)
Quantas lágrimas chorei

To see the full lyrics (TIP: press tab to autocomplete the file name)

cat maria_joanna_song

Show the first 5 lines

head -5 maria_joanna_song

To see only the last 5 lines

tail -5 maria_joanna_song

To see only the last 10 to 15 lines (note the “|” sign it is a pipe and let’s you combine the commands into one, without dumping the information onto your harddrive)

head -15 maria_joanna_song | tail -5

If you want to put information into your disk you can do as follows

head -15 maria_joanna_song > top15_maria
tail -5 top15_maria > lines10-15_maria

To see the output we need to either click on the file or use the terminal

cat lines10-15_maria

Searching text

How to find something in the text?

grep Maria maria_joanna_song

And how to see only the matched pattern - commands have “options”

grep -o Maria maria_joanna_song

You can always get help ( command --help or shortcut version -h) - generally not much usefull - better to google it or ask ChatGPT

grep --help

Let’s count word Maria - three different ways to count something

grep Maria maria_joanna_song | wc -l
grep Maria maria_joanna_song | wc -w
grep Maria maria_joanna_song | wc -c

Now count appearances of Maria Joanna - it is important to use “” if the your search contains a space or some special character

grep "Maria Joana" maria_joanna_song | wc -l
grep "Maria Joana" maria_joanna_song | wc -w
grep "Maria Joana" maria_joanna_song | wc -c

How to deal with special characters

grep ( maria_joanna_song   #note the error
grep "(" maria_joanna_song
grep \( maria_joanna_song

Wild cards

“.” and “*” can represent any characters and therefore represent patterns rather than exact matches

grep Eu maria_joanna_song
grep Eu.* maria_joanna_song

Playing with multiple choices

grep se maria_joanna_song 
grep se[i] maria_joanna_song 
grep se[i,m] maria_joanna_song

But I am in love with Donald Trump and not with Maria Joana - how to I fix the song?

sed 's/Maria Joana/Donald Trump/g'  maria_joanna_song | sed 's/Maria/Donald/g' > donald_trump_song

Let’s see what happened - now it doesn’t rhyme as good

cat donald_trump_song
more donald_trump_song

Ok, but are we sure Donald replaced Maria?

diff -u *song 
diff maria_joanna_song donald_trump_song

vimdiff *song

Note

exit vim using Esc + :q which closes the file, alternatively Esc + :wq which saves the file.

Working with multiple files

Have a look how a loop is being used here

grep Lisboa *song
for f in *song ; do grep Lisboa $f ; done

To show which file Lisboa is found in

for f in *song ; do grep -l Lisboa $f ; done

To show where is it found (i.e. line number)

for f in *song ; do grep -n Lisboa $f ; done

I have just realised that “joanna” in portuguese is written as “joana” - how do I fix the name of my file

option - fix it but keep the original

cp maria_joanna_song maria_joana_song

# list contents of directory
ls

# other variants
ls -l
ls -la

option - neh, I don’t need the previous version - it will just cause the confusion later on

mv maria_joanna_song maria_joana_song

Now let’s try previous code

grep Maria maria_joanna_song  # of course it doesn't work ... - what shall I do now ?

Genomics

Word on E-Utilities

Taken from official website:

EDirect will run on Unix and Macintosh computers, and under the Cygwin Unix-emulation environment on Windows PCs. To install the EDirect software, open a terminal window and execute one of the following two commands:

  sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

  sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"

This will download a number of scripts and several precompiled programs into an “edirect” folder in the user’s home directory. It may then print an additional command for updating the PATH environment variable in the user’s configuration file. The editing instructions will look something like:

  echo "export PATH=\$HOME/edirect:\$PATH" >> $HOME/.bash_profile

As a convenience, the installation process ends by offering to run the PATH update command for you. Answer “y” and press the Return key if you want it run. If the PATH is already set correctly, or if you prefer to make any editing changes manually, just press Return.

Once installation is complete, run:

  export PATH=${HOME}/edirect:${PATH}

to set the PATH for the current terminal session.

Windows

You can try to install SRA-toolkit from here

ESearch

Let’s download some data from the Genbank - it is 16S V4 amplicon data (here the metadata)

esearch -db sra -query PRJNA783372 | efetch -format runinfo > SraRunTable.csv

We only need the SRRnumber, -f1 means first field.

cut -f1 -d, SraRunTable*.csv  > runs.txt

To save time we will work on four samples only

tail -4 runs.txt > runs4.txt

Get the data

Ok, now we download the actual raw data

prefetch --option-file runs4.txt

In case of

Note

In case of Command 'prefetch' not found run

sudo apt install sra-toolkit

Let’s unpack it

for f in SRR* ; do fastq-dump --split-files  $f ; done

Tip

To visualize quality scores, small GUI tool fastqc could be used

Raw Reads Manipulation

sudo apt install flash

First merge the reads

# merge
flash -m 50 -M 500 -x 0.1 SRR17045222_1.fastq SRR17045222_2.fastq

# rename merged file to index it
mv out.extendedFrags.fastq file1.fastq

flash -m 50 -M 500 -x 0.1 SRR17045223_1.fastq SRR17045223_2.fastq
mv out.extendedFrags.fastq file2.fastq

flash -m 50 -M 500 -x 0.1 SRR17045226_1.fastq SRR17045226_2.fastq
mv out.extendedFrags.fastq file2.fastq

flash -m 50 -M 500 -x 0.1 SRR17045227_2.fastq SRR17045227_2.fastq
mv out.extendedFrags.fastq file2.fastq

Now we filter low quality reads (here the data is kind of a fake as the fastq download needs to be adjusted wit the prefetch script)

# if not installed, do
sudo apt install vsearch

for f in file* ; do vsearch --fastq_filter $f --fastq_maxee 1.0 --fastqout $f\_filtered ; done

It is getting messy here (delete unwanted files)

rm *notCombined*
rm *hist
rm *histogram
rm *fastq

Tip

Too lazy to empty the trash yourself ? the terminal will do that for us gio trash --empty

Let’s convert fastq to fasta

# if not installed, do
sudo apt install seqtk

for f in *fastq_filtered ; do seqtk seq -a $f > $f.fasta ; done ;

Caution

Blast is commented out as it is unlikely to work - we could try with internal database one day

To save time let’s just use 10 reads from each file for taxonomic identification - we will use GenBank nt database

for f in *fastq_filtered.fasta ; do head -20 $f > $f\head ; done

Simplify the naming


# if not installed, do
sudo apt install rename

rename 's/.fastq_filtered.fastahead/.fasta/g' *fastq_filtered.fastahead ;

Blast

Let’s try - it may not work at all, first install

sudo apt install ncbi-blast+

blastn -query file1.fasta -db nt -remote -outfmt 6 -max_target_seqs 1 -perc_identity 95 > file1_blast.txt 

blastn -query file1.fasta -db nr -remote  > file1_blast2.txt

If local `blastn` does not run on your PC

Use the web-based blast, for example on the ncbi server.

Let’s create ASV table and visualise the data

cat file[0-9]*.fastq_filtered.fasta > all_samples.fasta

# if not installed, do
sudo apt install vsearch

vsearch --derep_fulllength all_samples.fasta   --output dereplicated.fasta  --sizeout  --relabel ASV_
vsearch --sortbysize dereplicated.fasta  --output dereplicated_no_singletons.fasta  --minsize 2
vsearch --cluster_unoise dereplicated_no_singletons.fasta --minsize 2  --unoise_alpha 2   --centroids ASVs.fasta
vsearch --usearch_global all_samples.fasta --db ASVs.fasta   --id 0.97     --otutabout ASV_table.txt

We will use MicrobiomeAnalyst - every tool needs the input in some specific format - here we need to change the table as well

sed 's/#OTU ID/#NAME/g' ASV_table.txt > ASV_table_MA.txt

We need some metadata (all invented here)

head -1 ASV_table_MA.txt | fmt -1 | sed 's/NAME/NAME, sample/g'  | sed 's/11$/11,fish/g' | sed 's/13$/13,fish/g' | sed 's/21$/21,fish/g' | sed 's/22$/22,fish/g' |sed 's/23$/23,not_a_fish/g' | sed 's/24$/24,not_a_fish/g' |sed 's/25$/25,not_a_fish/g' | sed 's/27$/27,not_a_fish/g' > metadata.csv

Visualise the data with python

Following this link you can interrogate the same table (ASV_table_MA.txt) in the interactive jupyter notebook.

To compute Braycurtis distances

To visualize the PCoA

Local python scripts

If you have python installed you can run scripts which are in this folder.

python append_braycurtis3.py ASV_table_MA.txt &&
python pcoa_with_metadata_quick_legend.py

Finally

The power of command line lies in the fact, that once you test your pipeline command by command, you can copy all the commands into a so-called bash script and rerun the whole thing at once.

Caution

You may want to create a separate new virtual environment (example uses conda) and install necessary dependencies first

# create virtual environment
conda create -n gen-test python

# activate it
conda activate gen-test

# install deps (pip or conda install)
pip install dask
pip install --upgrade --force-reinstall scipy dask
pip install pyarrow

python3 append_braycurtis3.py ASV_table_MA.txt

# dependencies for the second script
pip install numpy==1.24.3
pip install seaborn
pip install panel
pip install scikit-bio

python3 pcoa_with_metadata_quick_work.py

# make a new folder for a test
mkdir pipeline_test

# copy python scripts too, if you want to visualize the results.
cp *.py ./pipeline_test

# take the commands from the command list file
grep -A 1000 "genomics part" course_linux_commands | sed '/fastqc/d' | sed '/\*\*\*/d' > ./pipeline_test/my_first_pipeline.sh

Run the pipeline with

bash my_fist_pipeline.sh

Caution

You might need to make the script executable

chmod +x my_fist_pipeline.sh

# then you can run directly
./my_fist_pipeline.sh

Simple Linux – An Introduction to Genomic Analysis on Linux

Setup

Step 1: Install VirtualBox

Step 2: Create a New Virtual Machine for Ubuntu

Step 3: Configure and Install Ubuntu

Command line

Why

Basics

Searching text

Wild cards

Working with multiple files

Genomics

Word on E-Utilities

ESearch

Get the data

Raw Reads Manipulation

Blast

If local blastn does not run on your PC

Visualise the data with python

Local python scripts

Finally

If local `blastn` does not run on your PC