RE: TSE-0331-1205, "Reliable Effects Screening: A Distributed 
Continuous Quality Assurance Process for Monitoring Performance
Degradation in Evolving Software 
Systems"
Manuscript Type: Regular

Dear Dr. Porter,

We have completed the review process of the above referenced
paper that was submitted to the IEEE Transactions on Software
Engineering for possible publication. Your reviews are
enclosed.

Based on these reports, Associate Editor, Dr. Littlewood has
recommended to the Editor-in-Chief that your paper undergo a
major revision.  We would suggest that you revise your paper
according to the reviewers' comments and resubmit the paper for
a second round of reviews.  If you wish to revise the paper,
please do so by <<4 July 2006>>.  The reviewers have several
questions and suggestions.  Dr. Littlewood would like you to
address all the questions raised by the reviewers in your
revision.

***********

Editor Comments

The reviewers generally liked you paper, as I did myself. Even
reviewer 3, who was most critical, thought there was a very good
paper struggling to get out of this one!

I am recommending 'major revision' largely because I agree with
the views of reviewer 3, and I hope that you will be able to
respond to his extensive suggestions (see his .pdf file). I
should say that this reviewer is a statistician who has great
experience in dependability and software engineering. He is
someone for whom I have the greatest respect.

Since your revision will mainly be addressing the criticisms of
reviewer 3, I would be happy to speed up the re-review process
by only asking for views from him and not from the others
(although, of course, you will have to satisfy me!)  
**************************

Reviewer Comments

**************************


Section III. Detailed Comments


A. Public Comments (these will be made available to the author)
 The paper extends the authors' previous work regarding "main
effects screening" - a distributed continuous quality assurance
(DCQA) process for assessing and improving performance of
evolving systems having a large number of configuration options.

They propose a new process called "reliable effects screening",
which includes verifying key assumptions taken during the
screening process without too much effort.

The topic is very relevant to IEEE TSE and the paper is well
written and most parts are easy to read.

Comments:
The configurations options that have been used, are all binary.
It will be useful to explain in a paragraph, how will some of
the key equations change if say options which can take four
values are used. Specifically what changes will happen to
equations 1 - 4.
If possible, the authors should give an appropriate reference(s)
in Section IIIC, to a previous study, which uses non-binary
options for similar experiments.

RESP: Designs for non-binary factors exist, but the details
of how to compute effects differ considerably across different
specific designs. We have inserted a reference indicating 
where readers can get further details.

On page 28: 3rd paragraph: Kindly mention how did you arrive at
exactly "additional 2328 configurations" for benchmarking.

RESP: We now explain in the text that this number is derived
automatically via a search-based design algorithm.

Page 31: In figure 11: If possible please put forth - Why is
that the last point an outlier in both top-5 and top-2 cases ? 

RESP: That point shows that the max of the actual data is higher
than the max of the estimate. There are many possible explanations 
for it. We see no clear explanation -- in particular, there is no
systematic higher level effect causing this. Keeping in mind that 
higher latency is bad, we suspect that this is just an outlier in 
the data, maybe from a heavily loaded machine or temporary network
congestion. In any event, we would prefer not to speculate on an 
explanation in the paper.

Figure 3 is slightly confusing to look at and does not
immediately convey what is being explained. I believe that it is
a tool screenshot (and so cannot be changed) - Kindly mention
that it is a screenshot in the paper. 

RESP: TBD

Typos and Minor corrections:

Page 35: Reliable effects screening: .... to be "rapid", cheap
and ....

RESP: Fixed

Page 29: 3rd line: "99.99%"

RESP: Fixed

Page 33: Please mention the units of latency on the y-axis (i.e.
ms, secs, etc.) 

RESP: TBD 

References: 
In [31] - "K.S.T. Vibhu ..." is not properly cited. Please
correct the names and the correct ordering of authors as given
at: http://dx.doi.org/10.1007/11424529_5

RESP: Fixed

Please consistently use 'In Proceedings of' in all the
references to papers in conference proceedings, rather than
using 'in Proc.', etc.

RESP: Fixed

Please be consistent in naming the authors: Please mention all
authors in [14] (and not "et.al"). 

RESP: Fixed

**************************

Reviewer 2
				

Section III. Detailed Comments


A. Public Comments (these will be made available to the author)
 Very thorough, substantially improved from conference paper.  

The example discussed in Section V, "using changes in option
importance to detect bugs" is intriguing.  It would be nice if
the authors could produce some numbers to back up the suggestion
that reliable effects screening would produce a "dramatic
change".  Could the numbers be run and reported?  Not essential,
but it would make the applications discussion stronger. 

RESP: We report that option B's effect would have been 2.5 times
greater after the change than it was before it.

**************************

Reviewer 3
				

Section III. Detailed Comments


A. Public Comments (these will be made available to the author)

General Comments
This paper contains some interesting and important developments
which I think should be reported. However, it is far too long
and convoluted and I just got bored stiff reading it. I'll
address the general issue of structure and then move on to some
detailed comments. Structure The paper consists of a number of
elements, a description of the group's approach to software
design, the analytical and experimental treatment of the
candidate designs, and their conclusions. The approach to
software design and the arguments supporting the analytical and
experimental work deserve to be published. Unfortunately, these
have been mixed in with a rather poor exposition of the use of
designed experiments. A very large part of the paper can be
referenced away. It is also striking that after the problem
statement, they give a description of something that is already
widely available in proprietary software. Lastly, as an applied
statistician rather than a software engineer, I find the
description of the problem excessively complicated and full of
grandiose language when what they are doing is, at least
conceptually, simple. 

Problem definition
I should like the authors to say something similar to the
following in clear plain English.
- They are developing a system to design and test software
systems.
- The software systems are largely Distributed Object Computing
systems
- The design phase uses conceptually:
o a number of sockets into which software elements can be fitted
(a modular approach);
o the system has to meet a number of performance requirements,
largely QoS;
o the software elements are standard components but their
performance can be adjusted through controllable parameters;
o there are environmental and interaction effects which are
difficult to predict.
- The objective is to automate the design by developing a system
to select the software elements and to set the appropriate
parameter values.
- The automation is necessary to allow the designers to cope
with the very large number of combinations within a single
design (there is a combinatorial explosion). Design and
analysis
This is the SKOLL part. 

The key points are:
- the system produces a design which may be a restricted version
of the final product to allow effective testing;
- the system produces the testing environment which can exercise
the product.

Experimentation
This part is largely a rather poor and heavy handed exposition
of experimental design. It should be dispensed with and
referenced out. Further, it is a re-invention of the wheel and
the references suggest an almost perverse ignorance of the
application of experimental design and statistical techniques in
other industries. They seem to borrow from without referencing
- six-sigma approaches
- "the seven old tools"
- Exploratory Data Analysis
There is much ready made software to implement theses parts of
their process which have been widely used and thoruoughly
proven: MINITAB, does everything they describe and provides
rather more help to the user. There is also a very good system
for statistical analysis available through the National
Institute of Standards and Technology (http://www.nist.gov/) and
Sematech (http://www.sematech.org/). The system is The
Engineering Statistics Handbook
(http://www.itl.nist.gov/div898/handbook/index.htm).
What they need to hold over in the article is the value of
screening experiments and the choice of the resolution of a
design. The details of fractional designs and D-optimal designs
can be dropped. If they do this, they can explain more clearly
about the values and limitations of their approach. In
particular, screening designs are all two level designs and so
miss the possibility of non-linearities in reesponses, they are
also poor on interaction (their discussion of aliasing anf
resolution). They do not need to say much about the
computational simplification that follows from using orthogonal
designs, we'll assume that there will be software available to
do this drudgery (MINITAB, for example). It is also striking
that they have not mentioned robust design, something which is
widely used in other industrial sectors. Robust design would
seem to be highly relevant in making software systems proof
against environmental variation. Lastly while discussing the use
of experimental design, it really isn't very useful when the
number of factors in the experiment is large. It is well known
(look at any book by Doug Montgomery) that while
the DOE experiments may produce feasible designs, even robust
designs, they cannot distinguish well between the large number
alternatives that meet the product requirements (an
identifiability problem).
They should look at the lecture from Richard Parry-Jones,
Engineering for Corporate Success in the New Millennium
(http://www.raeng.org.uk/news/publications/).

Recommendation
I think there is a really interesting and useful piece of work
struggling to escape from this articicle and the authors should
be encouraged to resubmit a heaavily revised version which
addresses the issues I have raised. They need to cut out all
thestatistical exposition. Clarify the nature of the problem
addressed as outlined above Relate their approach to other uses
of experimental methods in engineering design. 

References
I have mentioned exploratory analysis, the seven old tools and
Exploratory Data analysis. They can find an excellent exposition
of all of these things in the book below. Since this book covers
all the statistics mentioned in the article, there does not seem
to be much of a case for leaving them in.
Quality: Systems, Concepts, Strategies and Tools, W.J. Kolarik,
McGraw-Hill Education, March 1995, ISBN: 0070352178
**************************