本次代写是一个Unix生物信息学的限时测试
In each case, write down the exact lines of code that you have used
to answer the question, including all intermediate steps. For each
line of code, you should write an explanation below it that
describes what it is doing and the rationale behind its use. Some
of the questions can be answered with material directly learned in
lectures and workshops, however other parts may require additional
research into which approaches might be best to answer the question.
Please download the data file “assessment1.zip” and unzip the
directory.
Simple Example Question:
Calculate the number of lines in the file “example.txt”
Example Answer:
wc -l example.txt
206
The ‘wc’ function is used to calculate different features of the
example.txt file (including word, line and character counts), and
the ‘-l’ option specifically prints out the number of lines for this
file, which is what was requested in the question.
UNIX
1. List five types of unix command and describe what each command
does, using an example in each case.
– Any five commands with description (1 mark each) and reasonable
example use (1 mark each) – 10 marks total
2. Within the data directory, the second largest file contains
information on genetic variants that have been identified for a
group of human individuals. A collaborator is interested in
information about how the data was generated (contained within the
header of the file, lines that start with #), and also wants to see
some examples of genetic variant calls. Identify this file from the
directory and produce a new file from this that contains the header
and information on the last five genetic variants. Move this file to
a new directory that you call ‘CollaboratorA’.
– Answer will need to list and sort files by size and identify
second largest file, grep lines within that file that start with a
hash and print to a new file, then use tail to append the last five
lines to this new file, then create a new directory, then move the
file (five steps, five explanations – 10 marks total).
3. There are three text files labelled “Group*.txt” that you wish
to share with a collaborator as a single file, but you will have to
remove the date of birth information, as this is personal
information and you do not have permission to share it. Individual
14 has also withdrawn from the study, so you should not share data
for this individual. Create this file, maintaining a single header
at the top and including all permitted individuals across the three
groups.
– Will need to create a new file containing the header without the
DOB label. Then need to go through the files (loop or iteratively)
and add information from non-header lines, excluding the DOB column.
Then remove Ind14. (1 mark header, 1 mark removal of DoB in header,
1 mark to concat all information, 1 mark to do this while correctly
removing DOB, one mark to remove ind 14 – one make in each of these
cases for proper description: 10 marks total).
4. Within the directory there is a “.sam” file containing
information from a sequencing experiment in SAM format. The first
column of a SAM file contains the read name and subsequent columns
give information about where the read aligns to a reference genome.
Each sequencing read can align to multiple different locations and
thus have multiple different entries (on different lines) in the SAM
file. Count the total number of lines in the file, and then count
the number of unique read names in the file (so if a read name
occurs more than once on different lines, you should only count it
once).
– Will need to count the total number of entries (1 mark, plus
explain 1 mark), then need to select first column somehow (1 mark,
plus explain 1 mark), sort the names (1 mark, explain 1 mark), then
identify unique (1 mark, explain 1 mark), then count lines (1 mark,
explain 1 mark) – 10 marks total.