

The following file contains information about the different data sets
that are distributed with SciCraft.



File name                  Format                  Type
------------------------------------------------------------------------------------
ampicillin.mat             Matlab bin              Spectral data, regression
mixture_3_comp.mat         Matlab bin              Spectral data, regression
octane_prep.mat            Matlab bin              Spectral data, regression
spellman.mat               Matlab bin              Microarray, regression/class
wheat_spectra.mat          Matlab bin              Spectral data, regression
ovarian.RData              R bin                   Microarray, classification

-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------




More detailed description for each data set:

-------------------------------------------------------------------------------------

File name: 

AMPICILLIN.MAT

File objects:

  CalX      20x882                 141120  double array
  Caly      20x1                      160  double array
  ValX      20x882                 141120  double array
  Valy      20x1                      160  double array

Description:

This data set consists of 40 diffuse reflectance FT-IR spectra of
mixtures of the bacterium Staphylococcus aureus with the antibiotic
ampicillin added at different concentrations (0.5mM-20mM with a step
of 0.5 mM). Infra-red spectra (256 coadds) for each of these samples
were recorded in the wavenumber interval 4000 cm-1 to 600 cm-1 using a
Bruker IFS28 FT-IR spectrometer (Bruker Spectrospin Ltd., Banner Lane,
Coventry CV4 9GH, U.K.) equipped with a liquid N2-cooled MCT
(mercury-cadmium-telluride) detector and a diffuse-reflectance
absorbance TLC accessory.  We used 4.0 cm-1 waenumber resolution, and
spectra were collected at 20.s-1. The digitisation interval of the IR
instrument was set to produce 882 data points.

Source: Dr. Roy Goodacre, UMIST, UK
-------------------------------------------------------------------------------------



-------------------------------------------------------------------------------------

File name: 

MIXTURE_3_COMP.MAT

File objects:


  CalX       80x882                 564480  double array
  ValX       80x882                 564480  double array
  ans         1x479                    958  char array
  ycalx      80x3                     1920  double array
  yvalx      80x3                     1920  double array

Description:

This data set consists of 160 FT-IR spectra of the three compounds
histidine, glycine and sucrose at different concentrations. The span
of 27 different concentration distributions of each compound is shown
in Table 1. Six replicate 5ml aliquots of 27 samples consisting of
different combinations of histidine (100mM), glycine (300mM) and
sucrose (100mM) solutions were dried into wells in a sandblasted
aluminium plate. Infrared spectra were collected and data processed as
described for Data set 1 above, but using 16 coadds. Initially we had
6 replicates of each concentration distribution, but found that 12 of
the glycine replicates were outliers and were therefore removed from
the data set.

Source: Dr. Roy Goodacre, UMIST, UK
-------------------------------------------------------------------------------------



-------------------------------------------------------------------------------------

File name:

OCTANE_PREP.MAT

File objects:

  Xcal0         26x226                  47008  double array
  Xval0         13x226                  23504  double array
  calnames      26x16                     832  char array
  valnames      13x16                     416  char array
  ycal          26x1                      208  double array
  yval          13x1                      104  double array

Description:

Near infrared (NIR) data set of gasline samples with different levels
of octane numbers. The idea is to predict the octane number from the
NIR spectra using PLSR or other regression method. 


-------------------------------------------------------------------------------------



-------------------------------------------------------------------------------------

File name:

SPELLMAN.MAT

File objects:

  Gene      24x3452                662784  double array
  Time      24x1                      192  double array


Description:

Microarray data set of different time steps in a cell cycle.
-------------------------------------------------------------------------------------



-------------------------------------------------------------------------------------

File name:

OVARIAN.OCT

File objects:

  CalClass      1x1           27   string
  CalX         27x1536    331776   matrix
  Caly         27x1          216   matrix
  ValClass      1x1           27   string
  ValX         27x1536    331776   matrix
  Valy         27x1          216   matrix
  X            54x1536    663552   matrix
  class        54x1          432   matrix
  y             1x1           54   string

Microarray data set of ovarian tumors. Here we have created a
calibration and a validat data set from the origianl data set of 54
samples. The y data were created as follows (same for
CalX,ValX,CalClass,ValClass):

Caly = class(1:2:54);
Valy = class(2:2:54);

CalClass and ValClass are the corresponding vectors as strings (0 =
'N' for "normal" and 1='C' for "cancer")



-------------------------------------------------------------------------------------

File name:

OVARIAN.RDATA

File objects:

  Gene      54x1536               
  Label     1x54                  

Description:

Microarray data set of ovarian tumors
-------------------------------------------------------------------------------------

