this summative assessment, you will be working with a family trio of variant calls,investigating the properties of the data and reporting statistics and insights on thevariance calls. The variants have already been called and a VCF header, as well asa VCF file of a joint calling, is provided.(As a reminder, the VCF file format is described here).In our hypothetical scenario, the bioinformatician who initially performed the analysishas changed jobs without handing over the data properly, and the group ofgeneticists interested in the data approached you for help: What is in this dataactually? Such a type of data archaeology is not uncommon to do.They noticed that the file is too big for them to investigate in Excel but you cancertainly help them! Additionally, they need to interpret some of the found variances.For this task, you can freely choose which programming language or tools (or acombination) you want to use but it must work for future files in an automatedfashion as well (the group is just sequencing more individuals) so Excel is notallowed (but can be used to look at a small proportion of the data for orientation)!Your TaskThe VCF file contains a set of small nucleotide variances of three individuals (namedN1, N2, and N3). For each task, you will initially be presented with the questionsoutlining what the other scientists want to know, following this you will find astep-by-step guide to assist you.Collate all answers in a word document and submit it below, further details of what toinclude can be found in the guidelines.Task 1: Quality control? Which genome version and variant caller have been used in the analysis?Can the used pipeline be described using the information from the VCFfile?? How many and which types of SNVs have been detected with confidence?? Is the provided data complete or are there any missing data, as somechromosomes have not been called?Task 1 step by step:Step 1) Identify what analysis/pipeline and which version of the human genome havebeen used to detect SNVs?For this task, interpret the header information of the VCF file example_header.vcf.You might need to use an internet search using bits of the header to identify eachstep in the analysis. Report each step in the analysis you could identify andcomment on the completeness of the analysis, as well if anything remains unclear.[100-150 words, 10%]Step 2) How many SNVs been detected? How many by variance type (InDels, singlenucleotide substitutions)?For this task, write a small script or use a tool to delineate an overview of events inthe file. Break up the results for each individual in the file trio_example.vcf. Notably,the vcf file is already filtered for high-confidence (PASS) variance, so you dont needto filter further.[Answer as a small table, 2-3 sentences describing the observations, 10%] Pleasealso provide your script in the solution.Step 3) What is the frequency of SNV for each chromosome is the analysiscomplete?In this task, combine the length information for each chromosome and compute thefrequency of variation per chromosome (number variance / chromosome length), byvariation type, and by individual. Are there any outliers or chromosomes missing,given that you expect the variance being uniformly distributed (each chromosomeshould have the same rate of variance per length)?[Answer as a small table and a plot and two sentences interpretation, 10%] Pleasealso
provide your script in the solution.Task 2: Biological interpretation? The samples are from a family trio but the master excel sheet containingthe actual annotation for each sample has been lost could it be identifiedfrom the allele frequency of SNPs on the X chromosome?? Can some de novo mutations be easily interpreted with public databases?Task 2 step by step:Step 4) Can the sex of each individual be verified (even if imperfectly)?One scientist has the idea that variation of the X chromosome might differ fromautosomes (autosomes = all chromosomes except X and Y): there should be a lossof heterozygous instances on the X chromosome for male individuals. Analyse ifthere is sufficient data for each individual in the trio (also regarding the results fromStep 3 of Task 1, and argue based on the found data as well as from abiological/bioinformatics perspective if such a call is possible. To do so, you can forexample compute the ratio of homozygous to heterozygous SNPs for X chromosomeand autosomes and compare the found values within and between the sampleswhen discussing the data.[Answer as a small table and a plot and 2-5 sentences of interpretation, 10%] Pleasealso provide your script in the solution.Step 5) How many private mutations are there for each individual (mutations thatare not shared? (These would be de-novo mutation candidates in case of the child).Does the data allow you to identify the child, and how?Write a small script or use appropriate tools to extract the number and types of denovo mutations for each individual in the file.[100-150 words, 10%]. Please also provide your script in the solution.Step 6) Interpret mutations provided in the example_variance.vcf file using publicdatabases. Reflect on your findings.[100-150 words, 10%]Further informationPlease copy all the following files into your home directory on BLUEBEAR.You can access the VCF file header for the first question here:? example_header.vcfA file containing the variant calling for three individuals (father, mother, child) can befound here:? trio_example.vcfAnother file containing the lengths of each chromosome can be found here:? chromosome_lengths.tsvExample variance to interpret biologically in Step 6 can be found here:? example_variance.vcfIn this assessment, you are not restricted in how you write and structure your codeor which programming tool you use to achieve the results (other than Excel!).However, as an indication of an individual solution, you will need to submit your codealongside the solution (answers) to the questions. Nevertheless, it is the answer tothe questions rather than the code itself that will be checked for accuracy. However,do try to use the experience youve gained in the course so far to write readable andefficient code the best you can!Guidelines? Please note that the VCF files for this assignment will be provided byWeek 8.? This is a graded assignment contributing to 60% of your overall modulegrade.? After analysis, submit for each question an informed answer of 1 5sentences/50-200 words (including the relevant findings and numberswhere appropriate). You may use graphical representations/plots if youfind them useful to answer a question.? In addition to your answers, submit your tool commands, scripts or codenotebook (depending on the programming language or tools orcombination of them you have used).? Complete all tasks before submitting your assignment.