2015-06-13

removed extra spaces

New page

{{Dahlquist}}

__NOTOC__

<div style="float: right;">__TOC__</div>

This is based on the most current version of the data analysis protocol for the Dahlquist Lab microarray data, which can be found on the [[Dahlquist:Microarray_Data_Analysis_Workflow | Microarray Data Analysis Workflow page]].

'''Summary of steps for microarray data analysis'''

#Quantitate the fluorescence signal in each spot (GenePix Pro)

#Calculate the ratio of red/green fluorescence (GenePix Pro)

#Log transform the ratios (GenePix Pro)

#Normalize the ratios on each microarray slide (within-chip normalization)

#Normalize the ratios for a set of slides in an experiment (between-chip normalization)

#Perform statistical analysis on the ratios

#* Within-strain ANOVA

#* Modified t test for each timepoint

#* Between-strain ANOVA

#* Benjamini & Hochberg and Bonferroni p value corrections for the above three tests

#* "Sanity Check" on above three tests

#Pattern finding algorithms (clustering with stem)

#Gene Ontology term enrichment analysis (on clusters with stem or on gene sets with MAPPFinder)

#Pathway analysis (GenMAPP)

#Determining candidate transcription factors and gene regulatory network (YEASTRACT)

#Dynamical modeling with [http://kdahlquist.github.io/GRNmap/ GRNmap]; visualization with [http://dondi.github.io/GRNsight/ GRNsight]

This page has instructions for steps 7,8, and 10.

== Before you begin... ==

You will record all of the manipulations of the data in an electronic lab notebook stored on this wiki. Please see the [[Dahlquist:Wiki_Checklist | Wiki Checklist]] page for more details on how to do this.

=== Viewing File Extensions ===

* The Windows 7 operating systems defaults to hiding file extensions. To turn them back on, do the following: [[Image:FolderOptions.jpg|right|Folder Options window]]

*# Go to the Start menu and select "Control Panel".

*# In the window that appears, search for "Folder Options" in the search field in the upper right hand corner.

*# Click on "Folder Options" in the main window.

*# When the Folder Options window appears, click on the View tab.

*# Uncheck the box for "Hide extensions for known file types".

*# Click the OK button.

* The computers in Seaver 120 are are set to erase all custom user settings and restore the defaults once they have been restarted, so you will probably have to do this many times throughout the semester when using these computers.

=== Set Your Browser to Prompt You for the Location to Save your Downloaded Files ===

* In Mozilla Firefox, open the Options window.

** Select the radio button that says "Always ask me where to save files".

** You could also change the default "Save files to" location to your Desktop, so that will be the first choice when it prompts you where to save the file. (You will have to temporarily deselect the radio button to do this and then reselect it when you are done.

** Click OK to save your changes.

* In Google Chrome, open the Settings window.

** Click on the link at the bottom of the page that says "Advanced Settings".

** Check the box that says "Ask where to save each file before downloading".

** You could also change the default Download location to your Desktop, so that will be the first choice when it prompts you where to save the file.

** Your settings are automatically saved.

== Step 7-8: Clustering and GO Term Enrichment with stem ==

# '''Prepare your microarray data file for loading into STEM.'''

#* Insert a new worksheet into your Excel workbook, and name it "(STRAIN)_stem".

#* Select all of the data from your "(STRAIN)_ANOVA" worksheet and Paste special > paste values into your "(STRAIN)_stem" worksheet.

#** Your leftmost column should have the column header "MasterIndex". Rename this column to "SPOT". Column B should be named "ID". Rename this column to "Gene Symbol". Delete the column named "StandardName".

#** Filter the data on the B-H corrected p value to be > 0.05 (that's '''greater than''' in this case).

#*** Once the data has been filtered, select all of the rows (except for your header row) and delete the rows by right-clicking and choosing "Delete Row" from the context menu. Undo the filter. This ensures that we will cluster only the genes with a "significant" change in expression and not the noise.

#** Delete all of the data columns '''''EXCEPT''''' for the Average Log Fold change columns for each timepoint (for example, wt_AvgLogFC_t15, etc.).

#** Rename the data columns with just the time and units (for example, 15m, 30m, etc.).

#** Save your work. Then use ''Save As'' to save this spreadsheet as Text (Tab-delimited) (*.txt). Click OK to the warnings and close your file.

#*** Note that you should turn on the file extensions if you have not already done so.

# '''Now download and extract the STEM software.''' [http://www.cs.cmu.edu/~jernst/stem/ Click here to go to the STEM web site].

#* Click on the [http://www.andrew.cmu.edu/user/zivbj/stemreg.html download link], register, and download the <code>stem.zip</code> file to your Desktop.

#* Unzip the file. In Seaver 120, you can right click on the file icon and select the menu item ''7-zip > Extract Here''.

#* This will create a folder called <code>stem</code>. Inside the folder, double-click on the <code>stem.jar</code> to launch the STEM program.

<!--#** In Seaver 120, we encountered an issue where the program would not launch on the Windows XP machines due to a lack of memory. (Even though the computers have been upgraded to Windows 7, do this to launch the program.) To get around this problem, launch STEM from the command line.

#*** Go to the start menu and click on ''Programs > Accessories > Command Prompt''.

#*** You will need to navigate to the directory (folder) in which the STEM program resides. If you followed the instructions above and extracted the stem folder to the Desktop, type the following: <code>cd Desktop\stem</code> and press "Enter".

#*** To launch the program then type: <code>java -mx512M -jar stem.jar -d defaults.txt</code> and press "Enter". This will launch the program with less memory allocated to it.-->

# '''Running STEM'''

## In section 1 (Expression Data Info) of the the main STEM interface window, click on the ''Browse...'' button to navigate to and select your file.

##* Click on the radio button ''No normalization/add 0''.

##* Check the box next to ''Spot IDs included in the data file''.

## In section 2 (Gene Info) of the main STEM interface window, select ''Saccharomyces cerevisiae (SGD)'', from the drop-down menu for Gene Annotation Source. Select ''No cross references'', from the Cross Reference Source drop-down menu. Select ''No Gene Locations'' from the Gene Location Source drop-down menu.

## In section 3 (Options) of the main STEM interface window, make sure that the Clustering Method says "STEM Clustering Method" and do not change the defaults for Maximum Number of Model Profiles or Maximum Unit Change in Model Profiles between Time Points.

## In section 4 (Execute) click on the yellow Execute button to run STEM.

# '''Viewing and Saving STEM Results'''

## A new window will open called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value. Profiles with the same color belong to the same cluster of profiles. The number in each box is simply an ID number for the profile.

##* Click on the button that says "Interface Options...". At the bottom of the Interface Options window that appears below where it says "X-axis scale should be:", click on the radio button that says "Based on real time". Then close the Interface Options window.

##*Take a screenshot of this window (on a PC, simultaneously press the <code>Alt</code> and <code>PrintScreen</code> buttons to save the view in the active window to the clipboard) and paste it into a PowerPoint presentation to save your figures.

## Click on each of the SIGNIFICANT profiles (the colored ones) to open a window showing a more detailed plot containing all of the genes in that profile.

##* Take a screenshot of each of the individual profile windows and save the images in your PowerPoint presentation.

##* At the bottom of each profile window, there are two yellow buttons "Profile Gene Table" and "Profile GO Table". For each of the profiles, click on the "Profile Gene Table" button to see the list of genes belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your desktop. Make your filename descriptive of the contents, e.g. "wt_profile#_genelist.txt", where you replace the number symbol with the actual profile number.

##** Upload these files to [http://lionshare.lmu.edu LionShare] and e-mail a link to Dr. Dahlquist. (It will be easier to [[BIOL398-04/S15:Help#Compressing_Files_with_7-Zip | zip all the files together]] and upload them as one file).

##* For each of the significant profiles, click on the "Profile GO Table" to see the list of Gene Ontology terms belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your desktop. Make your filename descriptive of the contents, e.g. "wt_profile#_GOlist.txt", where you use "wt", "dGLN3", etc. to indicate the dataset and where you replace the number symbol with the actual profile number. At this point you have saved all of the primary data from the STEM software and it's time to interpret the results!

##** Upload these files to [http://lionshare.lmu.edu LionShare] and e-mail a link to Dr. Dahlquist. (It will be easier to [[BIOL398-04/S15:Help#Compressing_Files_with_7-Zip | zip all the files together]] and upload them as one file).

# '''Analyzing and Interpreting STEM Results'''

## Select '''''one''''' of the profiles you saved in the previous step for further intepretation of the data. I suggest that you choose one that has a pattern of up- or down-regulated genes at the early (first three) timepoints. You and your partner will choose the '''''same''''' profile so that you can compare your results between the two strains. Answer the following:

##* '''''Why did you select this profile? In other words, why was it interesting to you?'''''

##* '''''How many genes belong to this profile?'''''

##* '''''How many genes were expected to belong to this profile?'''''

##* '''''What is the p value for the enrichment of genes in this profile?''''' Bear in mind that we just finished computing p values to determine whether each individual gene had a significant change in gene expression at each time point. This p value determines whether the number of genes that show this particular expression profile across the time points is significantly more than expected.

##* Open the GO list file you saved for this profile in Excel. This list shows all of the Gene Ontology terms that are associated with genes that fit this profile. Select the third row and then choose from the menu Data > Filter > Autofilter. Filter on the "p-value" column to show only GO terms that have a p value of < 0.05. '''''How many GO terms are associated with this profile at p < 0.05?''''' The GO list also has a column called "Corrected p-value". This correction is needed because the software has performed thousands of significance tests. Filter on the "Corrected p-value" column to show only GO terms that have a corrected p value of < 0.05. '''''How many GO terms are associated with this profile with a corrected p value < 0.05?'''''

##* Select 10 Gene Ontology terms from your filtered list (either p < 0.05 or corrected p < 0.05).

##** Since you and your partner are going to compare the results from each strain for the same cluster, you can either:

##*** Choose the same 10 terms that are in common between strains.

##*** Choose 10 terms that are different between the strains (5 or so from each).

##*** Choose some that are the same and some that are different.

##**'''''Look up the definitions for each of the terms at [http://geneontology.org http://geneontology.org]. For your final lab report, you will discuss the biological interpretation of these GO terms. In other words, why does the cell react to cold shock by changing the expression of genes associated with these GO terms? Also, what does this have to do with HAP4 being deleted?'''''

##** To easily look up the definitions, go to [http://geneontology.org http://geneontology.org].

##** Copy and paste the GO ID (e.g. GO:0044848) into the search field at the upper left of the page called "Search GO Data".

##** In the [http://amigo.geneontology.org/amigo/medial_search?q=GO%3A0044848 results] page, click on the button that says "Link to detailed information about <term>, in this case "biological phase"".

##** The definition will be on the next results page, e.g. [http://amigo.geneontology.org/amigo/term/GO:0044848 here].

== Step 10: YEASTRACT ==

=== Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes ===

In the previous analysis using STEM, we found a number of gene expression profiles (aka clusters) which grouped genes based on similarity of gene expression changes over time. The implication is that these genes share the same expression pattern because they are regulated by the same (or the same set) of transcription factors. We will explore this using the YEASTRACT database.

# Open the gene list in Excel for the one of the significant profiles from your stem analysis. Choose a cluster with a clear cold shock/recovery up/down or down/up pattern. You should also choose one of the largest clusters.

#* Copy the list of gene IDs onto your clipboard.

# Launch a web browser and go to the [http://www.yeastract.com/ YEASTRACT database].

#* On the left panel of the window, click on the link to [http://www.yeastract.com/formrankbytf.php ''Rank by TF''].

#* Paste your list of genes from your cluster into the box labeled ''ORFs/Genes''.

#* Check the box for ''Check for all TFs''.

#* Accept the defaults for the Regulations Filter (Documented, DNA binding plus expression evidence)

#* Do '''''not''''' apply a filter for "Filter Documented Regulations by environmental condition".

#* Rank genes by TF using: The % of genes in the list and in YEASTRACT regulated by each TF.

#* Click the ''Search'' button.

# Answer the following questions:

#* In the results window that appears, the p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant". '''''How many transcription factors are green or "significant"?'''''

#* '''''List the "significant" transcription factors on your wiki page, along with the corresponding "% in user set", "% in YEASTRACT", and "p value".'''''

#** '''''Are CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 on the list?'''''

# For the mathematical model that we will build, we need to define a ''gene regulatory network'' of transcription factors that regulate other transcription factors. We can use YEASTRACT to assist us with creating the network. We want to generate a network with approximately 15-30 transcription factors in it.

#* You need to select from this list of "significant" transcription factors, which ones you will use to run the model. You will use these transcription factors and add CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 if they are not in your list. Explain in your electronic notebook how you decided on which transcription factors to include. Record the list and your justification in your electronic lab notebook.

#* Go back to the YEASTRACT database and follow the link to ''[http://www.yeastract.com/formgenerateregulationmatrix.php Generate Regulation Matrix]''.

#* Copy and paste the list of transcription factors you identified (plus CIN5, HAP4, GLN3, HMO1, SWI4, and ZAP1) into both the "Transcription factors" field and the "Target ORF/Genes" field.

#* We are going to generate several regulation matrices, with different "Regulations Filter" options.

#** For the first one, accept the defaults: "Documented", "DNA binding '''plus''' expression evidence"

#** Click the "Generate" button.

#** In the results window that appears, click on the link to the "Regulation matrix (Semicolon Separated Values (CSV) file)" that appears and save it to your Desktop. Rename this file with a meaningful name so that you can distinguish it from the other files you will generate.

#** Repeat these steps to generate a second regulation matrix, this time applying the Regulations Filter "Documented", "'''Only''' DNA binding evidence".

#** Repeat these steps a third time to generate a third regulation matrix, this time applying the Regulations Filter "Documented", DNA binding '''and''' expression evidence".

=== Visualizing Your Gene Regulatory Networks with GRNsight===

We will analyze the regulatory matrix files you generated above in Microsoft Excel and visualize them using GRNsight to determine which one will be appropriate to pursue further in the modeling.

# First we need to properly format the output files from YEASTRACT. You will repeat these steps for each of the three files you generated above.

#* Open the file in Excel. It will not open properly in Excel because a semicolon was used as the column delimiter instead of a comma. To fix this, Select the entire Column A. Then go to the "Data" tab and select "Text to columns". In the Wizard that appears, select "Delimited" and click "Next". In the next window, select "Semicolon", and click "Next". In the next window, leave the data format at "General", and click "Finish". This should now look like a table with the names of the transcription factors across the top and down the first column and all of the zeros and ones distributed throughout the rows and columns. This is called an "adjacency matrix." If there is a "1" in the cell, that means there is a connection between the trancription factor in that row with that column.

#* Save this file in Microsoft Excel workbook format (.xlsx).

#* Check to see that all of the transcription factors in the matrix are connected to at least one of the other transcription factors by making sure that there is at least one "1" in a row or column for that transcription factor. If a factor is not connected to any other factor, delete its row and column from the matrix. Make sure that you still have somewhere between 15 and 30 transcription factors in your network after this pruning.

#** Only delete the transcription factor if there are all zeros in its column '''AND''' all zeros in its row. You may find visualizing the matrix in GRNsight (below) can help you find these easily.

#* For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), we need to transpose the matrix. Insert a new worksheet into your Excel file and name it "network". Go back to the previous sheet and select the entire matrix and copy it. Go to you new worksheet and click on the A1 cell in the upper left. Select "Paste special" from the "Home" tab. In the window that appears, check the box for "Transpose". This will paste your data with the columns transposed to rows and vice versa. This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.

#* The labels for the genes in the columns and rows need to match. Thus, delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case.

#* In cell A1, copy and paste the text "rows genes affected/cols genes controlling".

# Now we will visualize what these gene regulatory networks look like with the GRNsight software.

#* Go to the [http://dondi.github.io/GRNsight/ GRNsight] home page (you can either use the version on the home page or the [http://dondi.github.io/GRNsight/beta.html beta version].

#* Select the menu item File > Open and select one of the regulation matrix .xlsx file that has the "network" worksheet in it that you formatted above. If the file has been formatted properly, GRNsight should automatically create a graph of your network. Move the nodes (genes) around until you get a layout that you like and take a screenshot of the results. Paste it into your PowerPoint presentation. Repeat with the other two regulation matrix files. You will want to arrange the genes in the same order for each screenshot so that the graphs can be easily compared.

<!--

# Now we will look at some of the network properties. Again, repeat these steps for each of the three gene regulatory matrices you generated above. See [[Media:BIOL398-04_S15_Week12_sample_network.xlsx | this file]] for an example of how to do the following instructions.

#* Create a new worksheet and call it "degree". Copy and paste your adjacency matrix from the "network" sheet into this new worksheet.

#* In the first empty cell in column A, type "Out-degree". In the cell to the right of that in Column B, type the equation <code>=SUM(</code> and select the range of cells in column B that has 1's and 0's in it, close the parentheses, and press Enter. This quantity is the number of genes that the transcription factor in that column is controlling, or the out-degree. Copy and paste that equation across all of the columns.

#* In Cell 1 of the first empty column to the right of the adjacency matrix, type "In-degree". In Cell 2 of this column, type the equation <code>=SUM(</code> and select the entire row of 1's and 0's, close the parentheses, and press Enter. This quantity is the number of transcription factors that regulate the gene in that row, or the in-degree. Copy and paste the equation down the entire column, including the row that contains the out-degree sums.

#* The number in the lower right-hand corner, the sum of sums, is the total number of edges in the adjacency matrix. We would like to see about 50 (40-60 or so) edges in the matrix. If the matrix is too dense, it will slow down the modeling program because it will be difficult to estimate the parameters in the model.

#* We want to plot the degree distributions for each of your gene regulatory networks. In the "degree" worksheet, create three columns to the right called "Frequency", "In-degree total", and "Out-degree total". In the "Frequency" column, number sequentially from 1 to the largest degree number in your calculations above. In the "In-degree total" column, type the number of genes with that in-degree for each of the frequencies. In the "Out-degree total" column, type the number of genes with that out-degree for each of the frequencies.

#* Select the "Frequency", "In-degree total", and "Out-degree total" columns. Go to the "Insert" tab and select the column chart type to insert a plot of the degree distribution. Copy and paste the charts for each gene regulatory matrix into your PowerPoint presentation.

-->

<!--In the future, look at their networks to make sure that their TF of interest is being regulated by at least one other factor and regulates at least one factor. They may need to fiddle around with this to find a network that does this. Also, have them upload their Excel spreadsheets to the wiki, not just figures in PowerPoint.-->

== Step 11: GRNmap ==

=== Create the Input Excel Workbook for the Model ===

# Your file will be similar to the file "21-genes_50-edges_Dahlquist-data_Sigmoid_estimation.xls", but with your expression data and network. You should download this file, change the name, and edit it to include your data. Make sure to give it a meaningful filename that includes your last name or initials. [https://github.com/kdahlquist/GRNmap/blob/master/test_files/data_samples/21-genes_50-edges_Dahlquist-data_Sigmoid_estimation.xls?raw=true Click this link to download the sample file from the GRNmap GitHub repository.)]

# The first thing you need to do is determine the transcription factors that you are including in your network. You are going to use the "transposed" Regulation Matrix that you generated from YEASTRACT in the previous section.

#* Copy the transposed matrix from your "network" sheet and paste it into the worksheets called "network" and "network_weights".

#* Note that the transcription factor names have to be in the same order and same format across the top row and first column. CIN5 does not match Cin5p, so the latter will need to be changed to CIN5 if you have not already done so.

#* It may be easier for you if you put the transcription factors in alphabetical order (using the sort feature in Excel), but whether you leave your list the same as it is from the YEASTRACT assignment or in alphabetical order, make sure it is the same order for all of the worksheets.

# The next worksheet to edit is the one called "degradation_rates".

#* Paste your list of transcription factors from your "network" sheet into the column named "StandardName". You will need to look up the "SystematicName" of your genes. YEASTRACT has a feature that will allow you to paste your list of standard names in to retrieve the systematic names [http://www.yeastract.com/formorftogene.php here].

#* Next, you will need to look up the degradation rates for your list of transcription factors. These rates have been calculated from protein half-life data from a paper by Belle et al. (2006). Look up the rates for your transcription factors from [[Media:Belle_PNAS_06_degradation_rates_203_TFs.xls | this file]] and include them in your "degradation_rates" worksheet.

#* If a transcription factor does not appear in the file above, use the value "0.027182242" for the degradation rate.

# The next worksheet to edit is the one called "production_rates".

#* Paste the "SystematicName" and "StandardName" columns from your "degradation_rates" sheet into the "production_rates" sheet.

#* The initial guesses for the production rates we are using for the model are two times the degradation rate. Compute these values from your degradation rates and paste the values into the column titled "ProductionRate".

# Next you will input the expression data for the wild type strain and one other strain (dcin5, dgln3, dhap4, dhmo1, dzap1, or spar; note that we can't use dswi4 because it only has 2 cold shock timepoints). You need to include only the data for the genes in your network, in the same order as they appear in the other worksheets.

#* Put the wild type data in the sheet called "wt".

#* The sample spreadsheet has a worksheet named "dcin5". Change this name to match the strain you are using (listed above). The instructions below should be followed for each strain sheet.

#* Paste the SystematicName and StandardName columns from one of your previous sheets into this one.

#* This data in this sheet is the Log Fold Changes for each replicate and each timepoint from the "Rounded_Normalized_Data" worksheet from the big Excel workbook in which you computed the [[Dahlquist:Microarray_Data_Analysis_Workflow#Step_6:_Statistical_Analysis | statistics]]. We are only going to use the cold shock timepoints for the modeling. Thus your column headings for the data should be "15", "30", and "60". There will be multiple columns for each timepoint (typically 4) to represent the replicate data, but they will all have the same name. For example, you may have four columns with the header "15".

#* Copy and paste the data from your spreadsheet into this one. You need to include only the data for the genes in your network. Make sure that the genes are in the same order as in the other sheets.

# The "optimization_parameters" worksheet should have the following values:

#* alpha should be 0.01

#* kk_max should be 1

#* MaxIter should be 1e08 (one hundred million in plain English)

#* TolFun should be 1e-6

#* MaxFunEval should be 1e08 (one hundred million in plain English)

#* TolX should be 1e-6

#* Sigmoid should be 1

#* estimateParams should be 1

#* makeGraphs should be 1

#* fix_P should be 0

#* fix_b should be 1

#* For the parameter "time" (Cell A13), we should have "15", "30", and "60", since these are the timepoints we have in our data.

#* For the parameter "Strain" (Cell A14), replace "dcin5" with the name of the second strain you are using, making sure that the capitalizaiton and spelling is the same as what you named the worksheet containing that strain's expression data. We are only going to compare two strains, so you can delete the other strain information.

#* For the parameter "Sheet" (Cell A15), give the number of the worksheet from left to right that your "Strain" log2 expression data is in. Delete any extra numbers because we are only comparing two strains.

# For the parameter "Deletion", leave the zero in cell B15 (corresponding to wt). In cell C15, put a number corresponding to the position in the list of gene names that the gene that was deleted appears. In the sample file, CIN5 is number 3 in the list. Note, disregard the column header in this count and only consider the actual gene names themselves.

#* For the parameter, "simtime", you perform the forward simulation of the expression in five minute increments from 0 to 60 minutes. Thus, this row should read: simtime should be 0, 5, <...fill by steps of 5...>, 60, each number in a different cell.

# The last sheet you will need to modify is called "network_b".

#* Paste in the list of standard names for your transcription factors from one of your previous sheets. Note that this sheet does not have a column for the Systematic Name.

#* Cell A1 in the sample files has the text "rows genes affected/cols genes controlling". I believe you can either have this text in cell A1 or "StandardName".

#* The "threshold" value for each gene should be "0".

# When you have completed the modifications to your file, upload it to [http://lionshare.lmu.edu LionShare] and send Dr. Dahlquist an e-mail with a link to the file.

==== Appendix: Full explanation of the "optimization_parameters" sheet ====

* <code>alpha</code>: Penalty term weighting (from an L-curve analysis)

* <code>kk_max</code>: Number of times to re-run the optimization loop: in some cases re-starting the optimization loop can improve performance of the estimation.

* <code>MaxIter</code>: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.

* <code>TolFun</code>: How different two least squares evaluations should be before it says it's not making any improvement

* <code>MaxFunEval</code>: maximum number of times it will evaluate the least squares cost

* <code>TolX</code>: How close successive least squares cost evaluations should be before MATLAB determines that it is not making any improvement.

* <code>Sigmoid</code>: <code>=1</code> if sigmoidal model, <code>=0</code> if Michaelis-Menten model

* <code>estimateParams</code>: <code>=1</code> if want to estimate parameters and <code>=0</code> if the user wants to do just one forward run

* <code>makeGraphs</code>: <code>=1</code> to output graphs; <code>=0</code> to not output graphs

* <code>fix_P</code>: <code>=1</code> if the user does not want to estimate the production rate, P, parameter, use initial guess and never change; <code>=0</code> to estimate

* <code>fix_b</code>: <code>=1</code> if the user does not want to estimate the b parameter, use initial guess and never change; <code>=0</code> to estimate

* <code>time</code>: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the expression sheets.

* <code>Strain</code>: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the names of the sheets for each strain.

* <code>Sheet</code>: A row where each entry is the order number of the sheet (left to right) that corresponds to the list of strains above.

* <code>Deletion</code>: Gives the index of the gene in the network sheet that has been deleted in each strain listed above. For example, if data has been provided for the CIN5 deletion strain, then give the index number from the network sheet corresponding to CIN5.

* <code>simtime</code>: A list of times for which the forward simulation should be evaluated.

=== Running GRNmap ===

You will now finally run the GRNmap model on the input workbook you created above. You will run the optimization twice; once where the threshold parameters, b, are '''not''' estimated and once where the threshold parameters ''''are''' estimated. You will compare the estimated weight and production rate parameters outputted by these two runs with each other.

# Download the current version of GRNmap from GitHub. Version 1.0.6 can be downloaded by following this [https://github.com/kdahlquist/GRNmap/archive/v1.0.6.zip link].

#* For the sake of organization, save it into a new folder called "GRNmap" either on your Desktop or within your "Microarray Analysis" folder.

#* Unzip the file by right-clicking on it and choosing 7-zip > Extract here.

# Open the "GRNmap-1.0.6" folder and open the "matlab" subfolder. Double-click on the file "GRNmodel.m" to open GRNmap in MATLAB 2014b.

# Click on the green triangle "Run" button to run the model.

#* You will be prompted by an Open dialog to find your input file that you created in the previous section. Browse and select this input file and click OK.

#* Note that the Open dialog will default to show files of <code>*.xlsx</code> only. If your file is saved as <code>*.xls</code>, you will need to select the drop-down menu to show all files.

#* A window called "Figure 1" will appear. The counter is showing the number of iterations of the least squares optimization algorithm. The top plot is showing the values of all the parameters being estimated. You should see some movement of the diamonds each time the counter iterates.

# Once the model has completed its run, plots showing the expression over time for all of the genes in the network will appear. Since we selected "makeGraphs = 1" these will automatically be saved as <code>*.jpg</code> files in the same folder as your input file. Compile the figures into a single PowerPoint file. Please label things clearly, placing an appropriate number of graphs on each page for a readable visual. Take some care to make sure that the graphs are the same size and the aspect ratio has not been changed. <!--maybe suggest to put graphs for the same gene side by side-->

# Create a new workbook for analyzing the weight data. In this workbook, create a new sheet: call it estimated_weights. In this new worksheet, create a column of labels of the form ControllerGeneA -> TargetGeneB, replacing these generic names with the standard gene names for each regulatory pair in your network. Remember that columns represent Controllers and rows represent Targets in your network and network_weights sheets.

# Extract the non-zero optimized weights from their worksheet and put them in a single column next to the corresponding ControllerGeneA -> TargetGeneB label.

# Now we will run the model a second time, this time estimating the threshold parameters, b. Save the input workbook that you previously created as a new file with a meaningful name (e.g. append "estimate-b" to the previous filename), and change fix_b to 0 in the "optimization_parameters" worksheet, so that the thresholds will be estimated. Rerun GRNmodel with the new input sheet.

# Repeat Parts (4) through (6) with the new output.

# Create an empty excel workbook, and copy both sets of weights into a worksheet.

# Create a bar chart in order to compare the "fixed b" and "estimated b" weights.

# Create bar charts to compare the production rates from each run.

# Copy the two bar charts into your powerpoint.

# Visualize the output of each of your model runs with GRNsight.

#* In order for this to work, you need to alter your output workbook slightly. You need to change the name of the sheet called "out_network_optimized_weights" to "network_optimized_weights"; i.e., delete the "out_" from that sheet name.

#* Arrange the genes in the same order you used to display them previously when you visualized the networks from YEASTRACT for both of your model output runs. Take a screenshot of each of the results and paste it into your PowerPoint presentation. Clearly label which screenshot belongs to which run.

#* Note that GRNsight will display differently now that you have estimated the weights. For positive weights > 0, the edge will be given a regular (pointy) arrowhead to indicate an activation relationship between the two nodes. For negative weights < 0, the edge will be given a blunt arrowhead (a line segment perpendicular to the edge direction) to indicate a repression relationship between the two nodes. The thickness of the edge will vary based on the magnitude of the absolute value of the weight. Larger magnitudes will have thicker edges and smaller magnitudes will have thinner edges. The way that GRNsight determines the edge thickness is as follows. GRNsight divides all weight values by the absolute value of the maximum weight in the matrix to normalize all the values to between zero and 1. GRNsight then adjusts the thickness of the lines to vary continuously from the minimum thickness (for normalized weights near zero) to maximum thickness (normalized weights of 1). The color of the edge also imparts information about the regulatory relationship. Edges with positive normalized weight values from 0.05 to 1 are colored magenta; edges with negative normalized weight values from -0.05 to -1 are colored cyan. Edges with normalized weight values between -0.05 and 0.05 are colored grey to emphasize that their normalized magnitude is near zero and that they have a weak influence on the target gene.

# Upload your PowerPoint, your two input workbooks, and your two output workbooks and link to them in your individual journal. Also upload the workbook where you made the bar charts comparing the weights from both runs.

#* Interpret the results of the model simulation.

#** Examine the graphs that were output by each of the runs. Which genes in the model have the closest fit between the model data and actual data? Which genes have the worst fit between the model and actual data? Why do you think that is? (Hint: how many inputs do these genes have?) How does this help you to interpret the microarray data?

#** Which genes showed the largest dynamics over the timecourse? In other words, which genes had a log fold change that is different than zero at one or more timepoints. The p values from the [[BIOL398-04/S15:Week 11 | Week 11]] ANOVA analysis are informative here. Does this seem to have an effect on the goodness of fit (see question above)?

#** Which genes showed differences in dynamics between the wild type and the other strain your group is using? Does the model adequately capture these differences? Given the connections in your network (see the visualization in GRNsight), does this make sense? Why or why not?

#** Examine the bar charts comparing the weights and production rates between the two runs. Were there any major differences between the two runs? Why do you think that was? Given the connections in your network (see the visualization in GRNsight), does this make sense? Why or why not?

#** Finally, based on the results of your entire project, which transcription factors are most likely to regulate the cold shock response and why?

#* Based on these results, what future directions do you want to take?

Show more