Summary of Chapter 1
- The shell: the “Swiss Army Knife” of programmers
- Take a peek at the data; Perform simple data manipulations; Automation; Glue
- Hundreds of small programs, each carrying out a specific task
- “Worse is better”
- Connect input and output using “pipe”
|
and redirection>
(or>>
)
Navigating the directory system
- To find out where you are:
pwd
(print working directory) - Move one up with
cd ..
; move down usingcd wheretogo
- Relative vs. absolute path:
cd /home/sallesina/CSB/unix
vs.cd ../CSB/unix
- Home directory
~
- Go back to where you were last
cd -
- List files and directories
ls
- Command modifiers
ls -l
(long format)
Working with files and directories
- Copy
cp myfile newloc
- Move (or rename)
mv myfile newloc
- Update file
touch myfile
(if file does not exist, it will be created) - Remove file
rm myfile
- Remove (empty) directory
rmdir mydir
- Remove (non-empty) directory
rm -r mydir
BEWARE! - First few lines
head myfile
- Last few lines
tail -n3 myfile
(three lines) - All but a few
tail -n+2 myfile
(all but first) - Concactenate file(s) (or print to screen)
cat myfile1 myfile2
- Count lines/words/bytes
wc myfile
(only lineswc -l myfile
) - Sort alphabetically lines
sort myfile
- Sort numerically
sort -n myfile
- Unique lines
uniq myfile
(with number of occurrencesuniq -c myfile
) The file needs to be sorted!
Pipes and redirection
- Send text to a file
echo "my first line" > myfile.txt
- Append text to a file
echo "my second line" >> myfile.txt
- Pipe operator: take whatever is on the left, and use as input for the command on the right.
ls -l | wc -l
(count files and sub-directories in current directory)sort myfile | uniq > no_duplicates.txt
sort and remove duplicates- Can become complex:
history | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
prints 10 most used commands
Delimited text files
- Very common format to store data
- Delimiter can be comma,
Tab
, semicolon, special character - You can select certain columns using
cut
cut -d; -f3-5 mycsv.csv
: extract columns 3, 4, and 5 frommycsv.csv
, which is a semicolon-separated filecut -f3,6,9 myspace.tsv
: by default, separated by spaces/tabs; select non-adjacent columns
Help with commands
- Very long manual:
man cut
To exit, pressq
Substituting characters
echo "aaabbbccc" | tr "a" "b"
changesa
intob
echo "aaabbbccc" | tr -d "a"
removes all occurrences ofa
echo "aaabbbccc" | tr -s "a"
squeezes all occurrences ofa
Warmup: Nobel data
The file data/nobel.csv
contains the names and motivations for all the Nobel prizes awarded so far.
- Find the number of winners for each Nobel prize (
chemistry
,economics
,literature
,medicine
,peace
,physics
).
Solution
tail -n+2 nobel.csv | cut -f3 -d, | sort | uniq -c | sort -nr- Find the winners of multiple Nobel prizes
Solution
cut nobel.csv -f5-6 -d, | sort | uniq -c | sort -n -r | head -n10
- Find the most common
surname
s among the winners
Solution
cut nobel.csv -f6 -d, | sort | uniq -c | sort -nr | head -n10- The Nobel prizes have not been awarder every year since 1901. Which one has been awarded the most? Which the least?
Solution
tail -n+2 nobel.csv | cut -d, -f 2-3 | sort | uniq | cut -d, -f2 | sort | uniq -c | sort -nrMatching lines with grep
grep
is based on regular expressions, which we will explore in Chapter 5. However, it is quite useful to work with literal matches (i.e., does the string appear in a line?). We will revisit this once we’re familiar with regex, to see how powerful it can be.
grep "a string" a_file.txt
return the lines ina_file.txt
that containa string
grep -c "a string" a_file.txt
just count how many lines matchgrep -w myword a_file.txt
just match entire words (i.e., flanked by spaces, punctuation, etc on the left and right)grep -i MyWoRd a_file.txt
ignore casegrep -n myword a_file.txt
also print line numbersgrep -v myword a_file.txt
invert match: only return lines that do not contain the patterngrep -o myword a_file.txt
only print the matched string (not the entire line)grep -A 1 myword a_file.txt
also print the following line (A
stands forA
fter)grep -B 3 myword a_file.txt
also print the preceding three lines (B
stands forB
efore)grep "pattern_1\|pattern_2\|pattern_3" a_file.txt
match one of several patterns
Warmup: Endangered species
Data taken from the European Red List
Species codes:
- EX Extinct
- RE Regionally Extinct
- CR Critically Endangered (= threatened species)
- EN Endangered (= threatened species)
- VU Vulnerable (= threatened species)
- NT Near Threatened
- LC Least Concern
- DD Data Deficient
- NA Not Applicable
- Look at the first few lines to familiarize yourself with the data
Solution
head European_Red_List.csv- Count the number of occurrences for each category (
EX
,RE
, etc.)
Solution
tail -n+2 European_Red_List.csv | cut -d, -f10 | sort | uniq -c- Repeat, but only consider birds (class
AVES
)
Solution
grep -w AVES European_Red_List.csv | cut -d, -f10 | sort | uniq -c | sort -nr- For each order of birds, compute the number of extinct/near extinct (
EX
,RE
orCE
) species
Solution
grep AVES European_Red_List.csv | grep -w "CR\|EX\|RE" | cut -d, -f5,10 | sort | uniq -cSimple scripts
You can collect a series of commands into a script. Moreover, you can introduce generic arguments, making your script more general. For example, what does this program do?
#!/bin/bash
# the previousnot a comment: where to find the bash program
cat $1 | sort | uniq > nodup_$1
- Save the script as
sandbox/script1.sh
- Make executable
chmod +rx script1.sh
- Try running
./script1.sh test_file.txt
A more complex example:
#!/bin/bash
# launch R code several times,
# each time using a different seed for the random number generator
for i in `seq 1 $2`;
do
echo "Launching program $1 with seed $i"
Rscript $1 $i
done
- Save the script as
sandbox/script2.sh
- Make executable
chmod +rx script2.sh
- Launch
./script2.sh a_test.R 7
More resources on bash
- A good collection of very simple scripts, showing you the type of things you can do.
- Many good resources here