RE: TSE-0331-1205, "Reliable Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems" Manuscript Type: Regular Dear Dr. Porter, We have completed the review process of the above referenced paper that was submitted to the IEEE Transactions on Software Engineering for possible publication. Your reviews are enclosed. Based on these reports, Associate Editor, Dr. Littlewood has recommended to the Editor-in-Chief that your paper undergo a major revision. We would suggest that you revise your paper according to the reviewers' comments and resubmit the paper for a second round of reviews. If you wish to revise the paper, please do so by <<4 July 2006>>. The reviewers have several questions and suggestions. Dr. Littlewood would like you to address all the questions raised by the reviewers in your revision. *********** Editor Comments The reviewers generally liked you paper, as I did myself. Even reviewer 3, who was most critical, thought there was a very good paper struggling to get out of this one! I am recommending 'major revision' largely because I agree with the views of reviewer 3, and I hope that you will be able to respond to his extensive suggestions (see his .pdf file). I should say that this reviewer is a statistician who has great experience in dependability and software engineering. He is someone for whom I have the greatest respect. Since your revision will mainly be addressing the criticisms of reviewer 3, I would be happy to speed up the re-review process by only asking for views from him and not from the others (although, of course, you will have to satisfy me!) ************************** Reviewer Comments ************************** Section III. Detailed Comments A. Public Comments (these will be made available to the author) The paper extends the authors' previous work regarding "main effects screening" - a distributed continuous quality assurance (DCQA) process for assessing and improving performance of evolving systems having a large number of configuration options. They propose a new process called "reliable effects screening", which includes verifying key assumptions taken during the screening process without too much effort. The topic is very relevant to IEEE TSE and the paper is well written and most parts are easy to read. Comments: The configurations options that have been used, are all binary. It will be useful to explain in a paragraph, how will some of the key equations change if say options which can take four values are used. Specifically what changes will happen to equations 1 - 4. If possible, the authors should give an appropriate reference(s) in Section IIIC, to a previous study, which uses non-binary options for similar experiments. RESP: Designs for non-binary factors exist, but the details of how to compute effects differ considerably across different specific designs. We have inserted a reference indicating where readers can get further details. On page 28: 3rd paragraph: Kindly mention how did you arrive at exactly "additional 2328 configurations" for benchmarking. RESP: We now explain in the text that this number is derived automatically via a search-based design algorithm. Page 31: In figure 11: If possible please put forth - Why is that the last point an outlier in both top-5 and top-2 cases ? RESP: That point shows that the max of the actual data is higher than the max of the estimate. There are many possible explanations for it. We see no clear explanation -- in particular, there is no systematic higher level effect causing this. Keeping in mind that higher latency is bad, we suspect that this is just an outlier in the data, maybe from a heavily loaded machine or temporary network congestion. In any event, we would prefer not to speculate on an explanation in the paper. Figure 3 is slightly confusing to look at and does not immediately convey what is being explained. I believe that it is a tool screenshot (and so cannot be changed) - Kindly mention that it is a screenshot in the paper. RESP: TBD Typos and Minor corrections: Page 35: Reliable effects screening: .... to be "rapid", cheap and .... RESP: Fixed Page 29: 3rd line: "99.99%" RESP: Fixed Page 33: Please mention the units of latency on the y-axis (i.e. ms, secs, etc.) RESP: TBD References: In [31] - "K.S.T. Vibhu ..." is not properly cited. Please correct the names and the correct ordering of authors as given at: http://dx.doi.org/10.1007/11424529_5 RESP: Fixed Please consistently use 'In Proceedings of' in all the references to papers in conference proceedings, rather than using 'in Proc.', etc. RESP: Fixed Please be consistent in naming the authors: Please mention all authors in [14] (and not "et.al"). RESP: Fixed ************************** Reviewer 2 Section III. Detailed Comments A. Public Comments (these will be made available to the author) Very thorough, substantially improved from conference paper. The example discussed in Section V, "using changes in option importance to detect bugs" is intriguing. It would be nice if the authors could produce some numbers to back up the suggestion that reliable effects screening would produce a "dramatic change". Could the numbers be run and reported? Not essential, but it would make the applications discussion stronger. RESP: We report that option B's effect would have been 2.5 times greater after the change than it was before it. ************************** Reviewer 3 Section III. Detailed Comments A. Public Comments (these will be made available to the author) General Comments This paper contains some interesting and important developments which I think should be reported. However, it is far too long and convoluted and I just got bored stiff reading it. I'll address the general issue of structure and then move on to some detailed comments. Structure The paper consists of a number of elements, a description of the group's approach to software design, the analytical and experimental treatment of the candidate designs, and their conclusions. The approach to software design and the arguments supporting the analytical and experimental work deserve to be published. Unfortunately, these have been mixed in with a rather poor exposition of the use of designed experiments. A very large part of the paper can be referenced away. It is also striking that after the problem statement, they give a description of something that is already widely available in proprietary software. Lastly, as an applied statistician rather than a software engineer, I find the description of the problem excessively complicated and full of grandiose language when what they are doing is, at least conceptually, simple. Problem definition I should like the authors to say something similar to the following in clear plain English. - They are developing a system to design and test software systems. - The software systems are largely Distributed Object Computing systems - The design phase uses conceptually: o a number of sockets into which software elements can be fitted (a modular approach); o the system has to meet a number of performance requirements, largely QoS; o the software elements are standard components but their performance can be adjusted through controllable parameters; o there are environmental and interaction effects which are difficult to predict. - The objective is to automate the design by developing a system to select the software elements and to set the appropriate parameter values. - The automation is necessary to allow the designers to cope with the very large number of combinations within a single design (there is a combinatorial explosion). Design and analysis This is the SKOLL part. The key points are: - the system produces a design which may be a restricted version of the final product to allow effective testing; - the system produces the testing environment which can exercise the product. Experimentation This part is largely a rather poor and heavy handed exposition of experimental design. It should be dispensed with and referenced out. Further, it is a re-invention of the wheel and the references suggest an almost perverse ignorance of the application of experimental design and statistical techniques in other industries. They seem to borrow from without referencing - six-sigma approaches - "the seven old tools" - Exploratory Data Analysis There is much ready made software to implement theses parts of their process which have been widely used and thoruoughly proven: MINITAB, does everything they describe and provides rather more help to the user. There is also a very good system for statistical analysis available through the National Institute of Standards and Technology (http://www.nist.gov/) and Sematech (http://www.sematech.org/). The system is The Engineering Statistics Handbook (http://www.itl.nist.gov/div898/handbook/index.htm). What they need to hold over in the article is the value of screening experiments and the choice of the resolution of a design. The details of fractional designs and D-optimal designs can be dropped. If they do this, they can explain more clearly about the values and limitations of their approach. In particular, screening designs are all two level designs and so miss the possibility of non-linearities in reesponses, they are also poor on interaction (their discussion of aliasing anf resolution). They do not need to say much about the computational simplification that follows from using orthogonal designs, we'll assume that there will be software available to do this drudgery (MINITAB, for example). It is also striking that they have not mentioned robust design, something which is widely used in other industrial sectors. Robust design would seem to be highly relevant in making software systems proof against environmental variation. Lastly while discussing the use of experimental design, it really isn't very useful when the number of factors in the experiment is large. It is well known (look at any book by Doug Montgomery) that while the DOE experiments may produce feasible designs, even robust designs, they cannot distinguish well between the large number alternatives that meet the product requirements (an identifiability problem). They should look at the lecture from Richard Parry-Jones, Engineering for Corporate Success in the New Millennium (http://www.raeng.org.uk/news/publications/). Recommendation I think there is a really interesting and useful piece of work struggling to escape from this articicle and the authors should be encouraged to resubmit a heaavily revised version which addresses the issues I have raised. They need to cut out all thestatistical exposition. Clarify the nature of the problem addressed as outlined above Relate their approach to other uses of experimental methods in engineering design. References I have mentioned exploratory analysis, the seven old tools and Exploratory Data analysis. They can find an excellent exposition of all of these things in the book below. Since this book covers all the statistics mentioned in the article, there does not seem to be much of a case for leaving them in. Quality: Systems, Concepts, Strategies and Tools, W.J. Kolarik, McGraw-Hill Education, March 1995, ISBN: 0070352178 **************************