Session 4. File manipulation (GAWK scripting)
gawk program file gawk '{STATEMENTS}' file gawk 'BEGIN{STATEMENTS} {STATEMENTS} END {STATEMENTS}' file
Accessing fields:
$0 Entire line $1 First field from current line $2 Second field from current line $n N-th field from current line $NF Last field from current line Built-in vars:
NF Number of fields in current line NR Number of records (processed lines) FS Field separator (default " ") OFS Output field separator (default " ") ORS Output record separator (between out lines, "\n") Statements:
- { i = 0; }
- { i = (j * 2) + 3 ; }
Filters:
- { if (pattern) action; }
- Patterns: a > b, a < b, a == b, a != b, a >= b, a <= b
- Booleans: pattern1 && pattern2 , pattern1 || pattern2, !pattern
Practice 4. File manipulation (GAWK scripting)
- % cd
- % mkdir work4
- % cd work4
- Save this sequence (FASTA) file
- Length of the sequence:
% grep -v ">" data/M23768.fa | fold -1 | wc
- GC content:
% grep -v ">" data/M23768.fa | fold -1 | sort | uniq -c | gawk '{print $2,$1/500}' | grep [CG] | gawk 'BEGIN{i=0}{i=i+$2;} END {print i}'
Access to the UCSC human genome browser at http://genome.cse.ucsc.edu. Go to the section Downloads (left frame) and select Annotation database. Then, save the files refGene.txt.gz and refLink.txt.gz
- % cd
- % cd work4
- Uncompress both files
- % head -5 refGene.txt
- % gawk '{print $0}' refGene.txt | more
- Number of genes:
% gawk '{print $0}' refGene.txt | wc
- Number of unique genes:
gawk '{print $1}' refGene.txt | sort | uniq | wc
- Genes with more than one isoform:
% gawk '{print $1}' refGene.txt | sort | uniq -c | gawk '{if ($1 > 1) print $0;}' | sort -n
- % gawk '{print "Gene",NR,"\t",$0}' refGene.txt | more
- For each gene, print the ID (NM), chromosome location, number of exons (sort):
% gawk '{print $1,$2,$8}' refGene.txt | sort +2nr | more
- The average number of exons:
% gawk 'BEGIN{l=0;} {l=l+$8;} END {print "AvExons =",l/NR}' refGene.txt
- How many genes have more than 10 exons:
% gawk '{if ($8>10) print $0}' refGene.txt | wc
- The average transcript length:
gawk 'BEGIN{l=0;} {l=l+($5-$4)} END {print "AvTLength =",l/NR}' refGene.txt
- % head -5 refLink.txt
- For each gene, print the name, the gene and protein IDs, the strand and the chromosome location:
% gawk 'BEGIN{OFS="\t";}{print $1,$2,$3,$8}' refGene.txt | sort > refGene.sort.txt% gawk 'BEGIN{FS="\t";OFS="\t";}{print $3,$4,$1}' refLink.txt | sort > refLink.sort.txt
% head -5 *sort*
% wc *sort*
% join refGene.sort.txt refLink.sort.txt | gawk 'BEGIN{OFS="\t";}{print $6,$1,$5,$2,$3,$4}' | sort | more
Enrique Blanco © 2004 eblanco@imim.es