Research Computing Task Force 2010-2011

Krzysztof Szalewicz, 2010-2011 Research Computing Task Force Chair

Printable version of the Report

  1. CHARGE FOR 2010-2011
  2. SCOPE
  3. EXECUTIVE SUMMARY
  4. STAFFING
  5. PHYSICAL INFRASTRUCTURE
  6. TRAINING AND SUPPORT
  7. CLUSTERS
  8. SOFTWARE
  9. EXPERIMENTAL SERVERS/CLOUD COMPUTING
  10. TERAGRID
2010-2011 Research Computing Task Force Membership List
  1. CHARGE

    Establish priorities and determine costs estimates for the tasks proposed in the Spring 2010 RCTF report (excerpts in blue italic).

  2. SCOPE

    The scope of interest for this Committee is research computing. However, whereas the scope is certainly not limited to high-performance computing, not all activities which can be classified as research computing are within it. One example of activity outside our scope would be the use of spreadsheet programs for data processing.

  3. EXECUTIVE SUMMARY

    Computing has become one of the major research tools in all fields of science. Effective use of such tools requires high level knowledge of hardware and software and involves significant costs. At the present, research computing at the University of Delaware is conducted by individual research groups, with a relatively low level of campus-wide support. The purpose of this proposal is to show that significant improved effectiveness and significant savings of costs can be achieved through a modest central support of research computing. The proposed actions will lower the barriers to the use of computational tools for faculty who have not used them so far and will encourage such faculty to apply advanced computational methods in their research. For the current users, the proposed actions will provide methods to work more effectively, to cooperate with each other, and to enhance the overall capabilities of university computer infrastructure by contributions from their grants. In turn, the existence of significant computational infrastructure will help investigators get their proposals funded.

    The Committee believes that the best way to reach these goals is through the following main actions (several other ways to improve effectiveness of research computing are discussed in the subsequent sections):

    • Hire immediately 2 staff persons, eventually increase the number to 5. The first two hires should be technical people able to manage computer clusters and to help researchers use these clusters effectively.

    • Purchase a "seed" cluster. We recommend spending about $1M for this cluster, but a quarter of this amount would get the community cluster started.

    • Increase support for software purchases by about $100K.

  4. Top

  5. STAFFING

    The task force feels that this is the most pressing item. The University is seriously understaffed in research computing support. The staff currently providing dedicated support are severely overloaded, and yet are constantly asked to take on more responsibilities. The University's research potential is compromised and may even be put in jeopardy by this staffing crisis. Further study is needed whereby we identify what specific needs we have and how we can meet them.

    • What type of people do we need?

      1. People, who would be in technical staff positions, with expertise significantly above departmental CITA. Should be able to solve hardware and software problems of clusters, have knowledge of programming languages and of parallel programming techniques, and be able to train users. Familiarity with large-scale scientific codes and with issues related to running such codes on thousands of processors is expected. Some of these people will be ready to answer questions from students, postdocs, and faculty related to research computing. One of tasks will be to support the campus-wide cluster discussed below and help departmental CITA with problems concerning clusters owned by individual groups. We do not envision these people to be significantly involved in any specific research activity, for example, writing computer codes for individual research groups, but should be able to provide substantial assistance to group members writing such codes.

        Not all people in this category should have the same duties. One person should be focused only on central hardware and software maintenance, system administration, job scheduling, and networking issues for the campus-wide cluster. His or her formal responsibilities to support departmental IT staff with their departmental/college clusters should be fairly limited and this person should not be tasked with direct user consulting for programming or for software applications.

        Two or more technical staff should be charged with supporting departmental IT staff, faculty, students, post-docs, and other researchers in all aspects of research computing. These tasks should also include training and consulting on programming and scripting languages, parallel programming techniques, and application support (e.g., mathematical, statistical software), collaboration tools, secure data management and distribution, etc.

        These positions should be high-level, technical staff support lines, filled by experienced staff who do not have their own research agenda, but whose career track is primarily in support of others (salary range $50K-$85K/12-months + 33% benefits). The hires should be on a career ladder more like the Research Scientist one rather than the CITA type. Current IT staff partly meet the requirements outlined above and can be assigned to these duties provided that new staff hires pick up their current work. Depending on the extent of the consulting demands, they could be supplemented or complemented by CIS graduate students working 20 hrs/week ($16K/9-months plus summer support).

      2. As second priority, staff with expertise in statistical methods (separately for social science and biomedical science) and visualisation is needed. Such people are likely to support as much instructional as research activities and therefore their funding should be distributed appropriately. Another example of a field-specific staff member could be an expert in computational chemistry software. In statistical methods, these hires should significantly enrich the currently existing support provided by IT's staff (PhD and MS-level consultants) and by graduate students. Salary requirements are similar as in (a).

      3. In addition to the staff personnel discussed above, it is proposed that UD hires a number of Research Professors or Research Scientists with expertise in research computing. Candidates for these positions could be postdoctoral associates who want to transfer into more permanent positions. Such people would continue to be supported from grants related to their research in about 50-70%, with the remaining support from UD. In return, these people should spend an appropriate fraction of their time training and consulting UD community in various aspects of research computing. The advantage of this option would be high level of expertise as these people will be themselves involved in computing-intensive research. This may be the only practical way of getting support staff truly familiar with large-scale research computing. The minimum salary would be an appropriate percentage of $78K/12-months plus 33% benefits.

    • How many research computing support staff should be hired? There is a wide range of staffing levels at our peer institutions. Some of them (U. Pittsburgh, RPI) run nationally-ranked supercomputer centers and employ many dozens of staff. Similar-scale or larger centers (not counting university-located supercomputer centers supported directly by federal agencies) are located at several other universities (e.g., Texas Advanced Computing Center at U. Texas, Cornell U. Center for Advanced Computing, High-Performance Computing and Communication center at USC, Center for Computational Research at U. Buffalo). Most of peer institutions have lower level of research computing support. Our survey shows the following staff numbers: Princeton—9, U. Virginia—12, Brown—8, Boston College—27, Penn State—7. We should aim at matching these sizes, initially with about about 5 people.

      We discussed also the question how many clusters can one person support, but concluded the answer can be found only case by case. Whereas time required depends only weakly on the cluster size, it does depend critically on the number of users if support is to include helping users to solve their problems. The approximate number for current hardware is about 10 small-size clusters, less than 10 users each, per CITA.

    • Time-frame: hire two people starting Fall 2011 to get us going. The initial hires should have the expertise discussed in point (a) above. One of the two hires should have the system administration responsibilities. The other should be a technical consultant, with particular strengths in programming languages, parallel programming techniques, and cluster environments. These choices will best complement the staff UD already has committed to research computing support.

    Top

  6. PHYSICAL INFRASTRUCTURE
    Chapel St. Computing Center is currently near capacity (HVAC, power, and to some extent, space). The accelerating need to grow the University's computing capacity makes a sustainable plan for more physical infrastructure the University's second most pressing need.

    This problem has been mostly resolved in the meantime. The recent power upgrades of the facility plus some approved upgrades in near future should provide capacity for hosting new clusters in the next four-five years. Dan Grim estimated that about 50% of floor space and power/AC capacity will be used by the Fall of 2011. Further capacities can be achieved in several ways: replacing aging servers, moving equipment to UD's remote disaster recovery site, more efficient use of space via rack-mounted systems, and further improvements to the physical plant infrastructure. New routers will be needed to provide increased communication bandwidth. Several other aspects have to be considered, like for example the need to backup the new clusters.

    One possibility to explore is creative use of the former Chrysler site. We did not have time for substantive discussions on proposals for this site but a parallel supercomputing facility would be a good use of the location's acreage and its access to power and high-bandwidth network connectivity.

    There is nothing the Committee can do about it in near future, but we should watch the development of UD plans for the site.

  7. Top

  8. TRAINING AND SUPPORT

    There was a strong sentiment that better training and support is needed on campus. In particular all clusters run some version of UNIX, and this operating system is often not familiar to incoming graduate students. Task Force members mentioned specific training needs in basic UNIX/Linux skills, shell scripting, Perl, Python, parallel programming techniques, specialized statistical topics and database programming. A broader set of UD faculty should be surveyed to determine training needs and priority and additional staffing requirements.
    ...

    New courses and trainings have already already been announced, in particular entry-level workshops on Unix have started in January 2011 and will be given at least twice a year. In addition to basic Unix, the series covered basic bash-shell scripting, and introductions to C and Fortran environments on Linux systems.

    Intermediate level training should be the next step. Such courses should include shell scripting and practical introduction to programing languages (ability to "read" C and/or Fortran programs). Part of this need can be satisfied by (preferably free) online courses. IT is currently compiling lists of elementary and advanced online resources for learning or becoming familiar with C, C++, Fortran, scripting languages such as Perl and Python, particularly for scientific applications, utilities such as make, and programming extensions such as MPI, OpenMP, parallel numerical libraries, debugging techniques, and performance tuning. These videos, Power Point slides and training materials will become part of a larger collection of training aids on the IT Research Computing site (www.it.udel.edu/research-computing).

    Some academic courses in Computer & Information Sciences are also relevant, but would require additional teaching staff to accommodate the increased demand. More work needs to be done to define this need and recommend further action.

    Some new computer courses are already available, for example, CISC 879 "Advanced Parallel Programming Models for Scientific and Engineering Applications" taught by Michela Taufer. The existing UD classroom capture services can be used to record such classes in high resolution. When these courses have student demand that cannot be filled without overloading the faculty member, the courses could be replayed at other times with a properly trained teaching assistant available to answer student questions.

    In terms of support, one question is what can be done right now in terms of answering community questions, in particular questions from graduate students, concerning problems related to research computing. Perhaps an "inventory of cyber-knowledge" should be created and posted online so that people will know whom to approach with different kinds of questions. This inventory could include not only IT staff, but also faculty and other researchers on campus which would promote researchers-helping-researchers interactions.

    Top

  9. CLUSTERS

    Managing and purchasing clusters are key concerns for many faculty on campus as we have already stated. Early on we resolved to inventory the clusters on campus.
    ...
    While there always will be a place for specialist clusters this suggests that many users might see improved functionality (i.e., more cores and memory) and less support headaches by combining purchases into a smaller number of larger units.

    Community Cluster purchase program [] allows researchers to contribute to hardware for a centrally managed cluster in return for very clear access to their hardware, and to the machine as a whole (of course much larger than any one group could afford).
    ...

    One may ask if there are enough people on campus involved in research computing to justify a community cluster. The Committee believes there are and that the time is right to go forward with such a cluster. The cluster inventory done last spring showed that UD researchers or departments owned over 50 different cluster systems. These are maintained by about 10 people within Arts & Sciences, DBI, Engineering, and Earth, Ocean and Environment. IT polled departmental and college systems administrators about how many faculty, staff, and students are actually using their systems and this number is in the hundreds. In fact, there are about 100 users in some departments (like the Department of Physics and Astronomy which runs 12 clusters).

    The following model for a Community Cluster emerged from discussion:

    • General-use community cluster

      This cluster is intended for all members of UD community. The initial, "seed" cluster should be purchased by UD and everybody interested will get access to this cluster. Such a cluster should be substantially larger than any of the existing clusters that were purchased from individual grants or multiply-PI instrumentation grants. The important novel element is that individual faculty will be able to add additional nodes to the cluster. In return, their research groups will get additional allocations of cluster resources, roughly proportional to the relative size of their investment. We expect this option to be very attractive for faculty as it will allow to acquire an appropriate computational platform without headaches of setting and maintaining an individual cluster.

      A community cluster should be initially built of relatively homogeneous nodes of commodity components as such systems not only offer the best price/performance ratio, but also are most straightforward to expand. Other key advantages of this architecture, compared to either heterogeneous collections of systems or vendor-specific models, include lower system administration and technical consulting support costs as well as lower hardware maintenance costs. Furthermore, a homogeneous system leads to a single pool of researchers having a common knowledge. The group can work more efficiently and effectively; it would need to learn how to optimize programs for a single architecture only, and the system would be easily navigated.

      Community clusters are typically based on architecture consisting of a large number of shared-memory multi-core (about 64 cores at the present) commodity nodes, with local storage for each node (several TB) and significant amounts of RAM memory (at least 2 GB per core). The nodes are connected by a medium-bandwidth, about 10Gb/s backbone network. An external file system (disk farm) may be a part of the cluster allowing processing of very large data (dozens TB). Such clusters are becoming a standard for many smaller and larger research institutions (e.g., UVa, Dartmouth, UW-Milwaukee, Purdue, Indiana U.). Fine tuning of the configuration in terms of balancing the components listed above can be obtained by surveying research-computing faculty. There can be several types of nodes in the cluster (differing mainly by amount of memory and local disk space).

      When a faculty adds nodes to the cluster, the configuration of nodes will be mainly up to this person, within general cluster restrictions. Such faculty can get priority access to these nodes, but we do not recommend creating separate cluster partitions for individual faculty.

      One more advantage of a commodity cluster is that it can run "forever". After a few years, obsolete racks can be simply replaced with current breed hardware and, from time to time, the backbone has to be upgraded.

      We believe that it would be appropriate for UD to purchase a "seed" cluster. Since the purchase prices for the largest clusters now on campus were around $600-800K, an investment of about $1M appears to be a reasonable amount to make a step forward. With such financing, we could purchase a cluster with nearly 10K cores which would put this cluster around place 300 on the world-wide list of TOP500 Supercomputer sites. For comparison, the number one system on this list has 186K cores and there are about 20 US university computers within the top 300. One can also consider asking faculty who have funds available to contribute already to the initial purchase, which may significantly increase the capacity of the system.

      Independently of the Community Cluster, UD should also improve utilization of existing clusters and provide some support for departmental clusters.

    • Agglomeration of existing clusters

      As an additional initiative which may significantly boost research computing on the campus, we propose the creation of a "super-cluster" from clusters already on campus, on a voluntary basis. Combining some existing clusters into larger units will have a significant synergistic effect, it particular, it will lead to a better load distribution and ability to perform larger-scale calculations than possible with individual clusters.

      The task of setting the super-cluster will require a dedicated staff member, so it can be done only after the proposed hires are on campus. The integration of existing clusters can take many forms. The simplest one is to set up a proper queueing/accounting system on all participating clusters (several software packages are available for sharing heterogeneous distributed systems, one example of such a package is Condor developed at U. Wisconsin) and allow access of all participating group members to all clusters. Simple rules can be formulated to determine allocation of resources proportional to the capacity contribution from a given group. A more integrated solution would be to connect the existing clusters by a fast backbone and create a single head-node allowing submissions to any compute nodes and simultaneous use of nodes on more than one cluster.

    • Funding for departmental clusters

      Several departments have departmental clusters and many of them need to be replaced. The community-cluster initiative should not interfere with this process. Some central funds from the Dean's and Provost's Offices, as well as from Unidel, should be available to supplement departmental funds for such purchases. However, departments should be encouraged to opt instead to contribute to the community cluster rather than to purchase an individual one. The former option would result in a proportional allocation of resources on the community cluster for members of a given department (also "virtual machines" can be created for departments on the community cluster if needed).

  10. Top

  11. SOFTWARE

    To a greater or lesser extent we are all users of commercial software products from statistical packages, to engineering and scientific simulation packages and general development environments. Efforts should be made to coordinate software license purchases so that users can leverage bulk purchases for cost efficiency, and existing licenses need to be well advertised. Funding resources need to be identified more effectively.
    ...

    UD should create a fund for purchasing campus-wide licenses for research software used by a large number of faculty (20 or more). Cost per group should be lower than the sum of costs of individual licenses. Options include mixed contributions from UD and from individual grants.

    Equally important as availability of funds is coordination of software acquisitions, even in cases of individual purchases or purchases by groups of faculty. IT has already been working on increasing campus-wide awareness of existing software and develops procedures to reduce existing inefficiencies. A committee should be created which will periodically evaluate software requests. The committee should make efforts to standardize software in cases when many similar packages are available and be aware of no-cost alternatives. We realize that the committee will often have to make difficult decisions (for example, deciding between SAS, SPSS, and STATA statistical packages). Maintaining lists of significant holdings of research software on central or departmental servers as well as in IT and departmental sites is a design goal of the IT Research Computing web site and new UDeploy site. This is also the place to register requests for new packages. The list should also include research-computing software above some value (say, $1K) purchased by individual faculty. If possible, such software should be tracked by the purchasing office of UD.

    The costs of research software vary from a few thousand dollars to even $100K for programs such as the SPSS statistical analysis package. Usually after a package is purchased, it can be used without additional investments for a few years until a new version is available. Examples of popular research software include packages for statistical analysis (SAS, SPSS), computational chemistry (Gaussian, Molpro, ADF, Charmm), and compilers (Intel, Portland). An estimated $200K/year is needed to cover basic needs, but more funds will certainly be welcomed. In fact, the current central software expenditures are already above $100K/year (ESRI GIS: $25K, AutoCAD: $20K, Mathematica: $16K, Adobe: $15K, IMSL: $8K, NAG: $8K, SAS: $9K, SAS JMP: $8K, although some of this software can only partially be classified as research-computing software). Some of these costs are shared by IT and individual researchers.

  12. Top

  13. EXPERIMENTAL SERVERS/CLOUD COMPUTING

    With the increased use of cloud computing and virtualization, some faculty felt that it is desirable to have a campus computing service offering on-demand virtual Linux and Windows machines. This is particularly important for the College of Business & Economics because cloud computing is a common solution for many service applications.

    Virtual machines can be created within community cluster as needed with capacities proportional to the allocation of the resources for a given group. Some fraction of capacity of community cluster can also be offered for cloud computing and the income can be used to add capacity. Dedicated cloud hardware purchased by UD would probably be underused since it is easy to purchase resources from outside providers and such hardware not proposed.

  14. Top

  15. TERAGRID

    TeraGrid offers NSF awardees vast resources for parallel computing. But projects often need to undergo pilot testing on local systems and show promise before being allowed to use TeraGrid resources. A Community Cluster approach could make such a machine available to a wider range of faculty who could not otherwise afford a 500+ processor machine.

    Training in the use of TeraGrid resources is needed and could be a combination of internal efforts coupled with information (perhaps web-based) from TeraGrid staff. We need a "Campus Champion" (TeraGrid's terminology) to promote TeraGrid and coordinate training.

    Community cluster should solve the testing problem, in particular if central UD funds are used to purchase the initial "seed" cluster, as proposed.

    IT is moving on the "Campus Champion" issue and Dean Nairn has been appointed the first UD's TeraGrid representative. Such representative should be able to provide initial assistance on technical issues involved with starting calculations on TeraGrid.

  16. Top

2010-2011 Research Computing Task Force Membership List

Name

Department

Phone

E-Mail

Mark Barteau

Research

4007

barteau@udel.edu

Dominic Di Toro

Civil & Environmental Engineer

4092

dditoro@udel.edu

Doug Doren

Arts & Sciences

1070

doren@udel.edu

Jeff Frey

IT-NSS & Col of Engrg

6034

frey@udel.edu

Jeffrey Heinz

Linguistics & Cognitive Science

2924

heinz@udel.edu

Xiaoming Li

Electrical and Computer Engrg

0334

xli@udel.edu

Peter Monk

Mathematical Sciences

7180

monk@math.udel.edu

Ratna Nandakumar

School of Education

1635

nandakum@udel.edu

Sandeep Patel

Chemistry & Biochemistry

6024

sapatel@udel.edu

Petr Plechac

Mathematical Sciences

0637

plechac@udel.edu

Dick Sacher

IT Client Support & Services

1466

dsacher@udel.edu

Michael Shay

Physics & Astronomy

2677

shay@udel.edu

Stephen Siegel

Computer & Info Sciences

0083

siegel@cis.udel.edu

Karl Steiner

Research

6703

ksteiner@udel.edu

Martin Swany

Computer & Info Sciences

2324

swany@udel.edu

Michela Taufer

Computer & Info Sciences

0071

taufer@udel.edu

Dion Vlachos

Chemical Engineering

2830

vlachos@udel.edu

Harry Wang

B&E Accounting & MIS

4678

hjwang@lerner.udel.edu

Cathy Wu

Computer & Info Sciences

8869

wuc@udel.edu

Xiao-Hai Yan

Col Earth Ocean & Environment

3694

xiaohai@udel.edu