Intro to the UNIX environment

Session 4. File manipulation (GAWK scripting)

gawk program file

gawk '{STATEMENTS}' file

gawk 'BEGIN{STATEMENTS} {STATEMENTS} END {STATEMENTS}' file

Accessing fields:

$0 Entire line
$1 First field from current line
$2 Second field from current line
$n N-th field from current line
$NF Last field from current line

Built-in vars:

NF Number of fields in current line
NR Number of records (processed lines)
FS Field separator (default " ")
OFS Output field separator (default " ")
ORS Output record separator (between out lines, "\n")

Statements:

{ i = 0; }

{ i = (j * 2) + 3 ; }

Filters:

{ if (pattern) action; }

Patterns: a > b, a < b, a == b, a != b, a >= b, a <= b

Booleans: pattern1 && pattern2 , pattern1 || pattern2, !pattern

Practice 4. File manipulation (GAWK scripting)

% cd

% mkdir work4

% cd work4

Save this sequence (FASTA) file

Length of the sequence:
% grep -v ">" data/M23768.fa | fold -1 | wc

GC content:
% grep -v ">" data/M23768.fa | fold -1 | sort | uniq -c | gawk '{print $2,$1/500}' | grep [CG] | gawk 'BEGIN{i=0}{i=i+$2;} END {print i}'

Access to the UCSC human genome browser at http://genome.cse.ucsc.edu. Go to the section Downloads (left frame) and select Annotation database. Then, save the files refGene.txt.gz and refLink.txt.gz

% cd

% cd work4

Uncompress both files

% head -5 refGene.txt

% gawk '{print $0}' refGene.txt | more

Number of genes:
% gawk '{print $0}' refGene.txt | wc

Number of unique genes:
gawk '{print $1}' refGene.txt | sort | uniq | wc

Genes with more than one isoform:
% gawk '{print $1}' refGene.txt | sort | uniq -c | gawk '{if ($1 > 1) print $0;}' | sort -n

% gawk '{print "Gene",NR,"\t",$0}' refGene.txt | more

For each gene, print the ID (NM), chromosome location, number of exons (sort):
% gawk '{print $1,$2,$8}' refGene.txt | sort +2nr | more

The average number of exons:
% gawk 'BEGIN{l=0;} {l=l+$8;} END {print "AvExons =",l/NR}' refGene.txt

How many genes have more than 10 exons:
% gawk '{if ($8>10) print $0}' refGene.txt | wc

The average transcript length:
gawk 'BEGIN{l=0;} {l=l+($5-$4)} END {print "AvTLength =",l/NR}' refGene.txt

% head -5 refLink.txt

For each gene, print the name, the gene and protein IDs, the strand and the chromosome location:
% gawk 'BEGIN{OFS="\t";}{print $1,$2,$3,$8}' refGene.txt | sort > refGene.sort.txt
% gawk 'BEGIN{FS="\t";OFS="\t";}{print $3,$4,$1}' refLink.txt | sort > refLink.sort.txt
% head -5 *sort*
% wc *sort*
% join refGene.sort.txt refLink.sort.txt | gawk 'BEGIN{OFS="\t";}{print $6,$1,$5,$2,$3,$4}' | sort | more

Enrique Blanco © 2004 eblanco@imim.es

$0	Entire line
$1	First field from current line
$2	Second field from current line
$n	N-th field from current line
$NF	Last field from current line

NF	Number of fields in current line
NR	Number of records (processed lines)
FS	Field separator (default " ")
OFS	Output field separator (default " ")
ORS	Output record separator (between out lines, "\n")