Minutes from the Feb. 3-5 1999 MPI/RT Forum meeting that was held in San Diego, CA. Attendees: Arkady Kanevsky (chair) MITRE arkady@mitre.org Manoj Apte MSU manoj@erc.msstate.edu Randall Judd SSC-SD judd@spawar.navy.mil Dennis Cottel (host) SSC-SD dennis@spawar.navy.mil Leonard Monk CSPI lmonk@cspi.com Nathan Doss LM-GES nathan.e.doss@lmco.com Shane Hebert MSU shane@erc.msstate.edu Darwin Ammala MSTI dammala@mpi-softtech.com Clay Taylor, Jr MSTI cdtaylor@mpi-softtech.com Michael Grieco ASEC/NSA mjgrieco@jswg.org James Lebak MIT/LL jlebak@ll.mit.edu Steve Paavola SKY paavola@sky.com Rick Massary NAWCAD massaryrj@navair.navy.mil Rob Ginn NAWCAD rob@sun701.nawcad.navy.mil SUMMARY There were no official or straw votes. The following action items were discussed and assigned for the proposal writing: A. Arkady - RT-Mode changes (topic 6) B. Shane - K-Slack channels (topic 8) C. Steve - Data descriptors (topic 11) D. Dennis - Sleep, wake up, wait on (topic 15) E. Shane - Bufiter Dataspec homogeneity (topic 22) F. Arkady & Leonard - Unreliable channels (topic 23) G. Shane - Half channels (topic 24) H. Steve - Relocatable buffers (topic 43) I. Steve - Variable length buffers (topic 44) J. Leonard & Arkady - Unique ID (topic 46) H. Arkady - Update channel state transition diagram and write a description for it for MPI/RT-1.0. 1. The meeting came to order at 8:30am on Wednesday February 3. 2. The updated list of action items for the MPI/RT was discussed. The following items were added to the list over the three days: a. Light-weight handlers b. Half channels c. Default handlers d. Group parameter scope for the commit e. Mechanisms for passing in externally build schedules to MPI/RT f. Logical grouping of event kinds (arithmetic on events) g. PGCOs h. QoS for handlers (add deadlines) i. Mutable buffers (late binding of address) j. Variable length buffers k. Periodic triggers l. Unique Id for running MPI/RT version m. QoS for QoS (granularity for raising errors) 3. The topic of real-time mode changes was discussed (topic 6). First, the short summary of the history of the topic was presented. The objects MPIRT_RESOURCE and MPIRT_TRANSIT were reintroduced. A general discussion on resource allocation, resource reservation, admission test, and transition between real-time mode took place. It was decided that for now we should restrict this proposal to MPIRT_GROUP_WORLD and only after that is completed consider other groups. This decision was based on the following question: How do an implementation deal with transition with QoS if it does not know when objects are in use by other groups? The agreement was reached that there is at most one active real-time mode per process. This ensures that all objects are available for transition. Trade-offs of creating resource objects without knowing future transitions vs. knowing the entire graph of all transitions were considered and discussed. It was pointed out that the same object may be created by two different resource objects differently, that may not allow transition between them preserving the state of the object or may not allow transition with the requested transition QoS. Some discussion concentrated on an "incremental" admission. The QoS specification for real-time mode changes was discussed. It was agreed that initially we should consider only best effort. The deadlines and events that trigger transition should be discussed after that. A synchronization of transition from one real-time mode to another on different processes (processors, nodes) was considered as part of this QoS discussion. The topics of the relationship of MPI/RT to the OS, hardware, and other parts of the system were also partially discussed. We agreed that there is a need to support some objects to remain operational during a transition (background channels). The effect of this to support layered library was also considered. Arkady agreed to write a proposal that allows to build the two discussed objects in total isolation, with partial transition graph, and with the full transition graph. It is up to the user to decide which information to provide for the creation of these objects. This proposal will also support operating objects during transition. 4. A concern was raised to the lack of examples in the document. A tutorial will be presented in MPIDC'99 and will be available on MPI/RT web after words (mid March). The example chapter will be updated and be included in the final document of MPI/RT-1.0. 5. Half channels (topic 24) was discussed. The following issues were brought forward: a. periodic half channels, b. data streams, c. interrupts to pull data to half channel buffer, d. aperiodic interrupts, e. event kinds for half channels (only receptors), with triggers as I/O interrupts, f. interface consistency with existing channel end points specification: buffers, bufiters, QoS and so on, g. admission test and QoS specification for half channels, h. devices (MPI/RT does not control devices, drivers are outside of MPI/RT), i. data can be "dumped" into a buffer without interrupting CPU, j. device registration, k. hooking up half channels with devices, l. scheduled pulling, m. random vs. deterministic half channels (dataspecs, data size) A device driver model with defined "operation-func" that will be invoked on an interrupt to pull the half channel data into a buffer was accepted. A completion notification of half channel message transfer uses the existing channel end point functionality of events on a channel and/or bufiter. A relationship between MPI/RT and other entities on a platform again was considered. A consensus is forming that MPI/RT internally specifies constraint on the use of resources (bus, memory, and so on) for other entities outside of the MPI/RT. An issue of multiple devices "dumping" data into the same half-channel was considered. An issue of extra states for a half channel was brought forward. A tentative template of the half channel create function was written: MPIRT_HALF_CHANNEL_CREATE ( activate_function, deactivate_function, operation_function, ... inbufiter, outbufiter, hc_qos, ... other standard channel end point parameters ... *hc); The direction of a half channel was briefly considered as well as operations on a half channel. Shane was "volunteered" to summarize the discussion in the form of the proposal. 6. K-slack (topic 8) was discussed. The main issue it to allow an implementation to transfer more than one buffer at a time to improve bandwidth. For example, an implementation may be able to save on synchronization ("ack", "nack"). The current MPI/RT-1.0 specification allows at most one buffer to be in a channel at a time. A state transition diagram depicts this requirement. Some protocols work more efficiently if multiple buffers can be transferred "together", for example "sliding window" protocol. It was agree that "k" means that at most k buffers can be in possession of a channel at any one time. But there is no requirement or a guarantee that an implementation will take advantage of that feature. There was a discussion where we should specify "k". Early vs. late binding. It was agreed that the best place to specify it at the early binding, either as the QoS or as a channel parameter. The forum was leaning to the later. Channel end points do not require to specify the same slack number. At commit time the implementation will negotiate the slack number. It should probably return the negotiated number back to the users. The biggest discussion centered on new and modified operations on a channel to support k-slack. The following four operations are under consideration: a. Wait_All - that returns when all "outstanding" transfers complete (a "window" transfer is completed), b. Wait - returns as soon as the "first" outstanding transfer completes (Notice that several of transfers may complete at the same time). Other interpretations may also be possible and should be discussed further. c. Get_Slack - returns the number of outstanding transfers in the channel, d. Wait_on_specific_buffer - returns when the specified buffer is in the out-bufiter of the channel. This operation was discussed in great details on the last day of the meeting. It may require to "poll" on every buffer of the channel and on every channel that may share that buffer (or bufiter). Hence, this operation may be quite expensive and may adversely effect the performance of other objects. We may consider a different operation that will allow user to find out when the n-th transfer had completed over the channel. Once the proposal is written we will return on the issue of the purpose of these operations and what users are trying to achieve with them. The relationship between individual Waits and Starts need to be considered further as well as the changes needed for the channel state transition diagram. Shane again was volunteered to write a proposal as a representative of MSU. 7. Unreliable channels (topic 23) were discussed. It was clear that the restriction that MPI/RT "guarantees that the underlying transmission of messages is reliable," that MPI/RT inherited from MPI, need to be reconsidered. [It was pointed out that this corner stone assumption seems to be missing from MPI/RT-1.0 document.] After long discussion it became clear that the forum does not have a clear picture how this unwritten requirement of the "guarantee delivery" is interpreted. For example, can an implementation "mask" a member of a group that is not "reliable" so as far as all other members of the group "active"? Allowing that together with "unreliable" channels will allow other members of a group to "communicate with it unreliably". The dividing line between this topic and fault-tolerance, fault-handling, dynamic process management, instrumentation and other topics is blur. Consensus seems to be forming that unreliability should be part of the channel QoS specification. If a channel is unreliable the requirement for the implementation to raise an error when the QoS is missed is removed. The implementation can raise that error but users should not rely on that for unreliable channels. [This brought a question on the QoS error for the best effort QoS for the current specification. Also a timing issue when a buffer is moved from in-bufiter to a channel and from the channel to the out-bufiter for best effort QoS.] [It was pointed out that the definition of reliability need to be added to the glossary.] The issue of the "global" state view was brought up again. Arkady and Leonard agreed to flush out more of these issues and summarized the unreliable channel discussion into a proposal. 8. The data descriptors (topic 11) were discussed. It was pointed out that strided DMAs may not improve performance vs. local rearrangement of data in memory and then transferring it. This is true for DRAM but not for SRAM. It was decided that we will not now consider strides as part of dataspecs but instead let users do packing and unpacking themselves. The issue of different dataspecs in one buffer was considered as well as contiguous memory buffers vs. scatter-gather ones. It was decided not to address these issues for now but instead consider two other issues: mutable buffers (late binding of buffer address) and variable length buffers. Some of the postponed issues will be reconsidered when we address topic 1 of Data Reorganization. The definition of structures was considered as part of the above discussion. The buffer descriptor consists of three parts: dataspec, size, and address. The first two parts of this 3-tuple is the data descriptor. The forum agreed to defines a message buffer which is an array of 3-tuples. This array along with its size will replace a singleton tuple which is currently used for buffer specification. Each tuple describes a contiguous memory. A system allocated tuples must be pairwise disjoint for the same message buffer, but user allocated tuples can overlap. The data descriptor parts of all tuples are still required to be matched between channel end points. Steve graciously agreed to write this proposal. 9. The topic of requiring data descriptors to be the same for channel end points (topic 22) was briefly discussed. It was agreed that dataspecs must be matched but that we should relax this requirement for the size. [This discussion was done independently from the previous one. It was agreed that we will consider integration of the proposals only after each one of them is approved independently.] This flexibility allows users to use the same buffer in multiple channels instead of creating multiple buffers over the same memory to be used for each channel. This is part of the early binding specification, hence the implementation can still set up its channel transfers of the appropriate size during commit. Details of allocated size and transmitted size will be flushed out in the proposal. This will not effect the buffer specification but allow the bufiters for the channel end points to specify different length for message transmission. Shane will prepare this proposal. 10. The issue of SLEEP, WAKE-UP, WAIT-ON (topic 15) was reintroduced. The old problem of what the "main" application should do while the application is running as 0-sided communication with handlers to schedule computation? The old idea of IDLE was reintroduced. The need of blocking on something was reiterated. The waiting on nothing (IDLE), for time (absolute or relative time_spec), and on event were considered. It was agreed that a single operation MPIRT_RECEPTOR_WAIT (receptor, time_spec) will allow to support all three cases. This operation will be blocking and late binding but the "special type of a receptor" will be registered (somehow) at the commit time. Time_spec allows to set up a timeout so that operation will wait on both time and event. If both specified as NULL then this is equivalent to IDLE. Dennis volunteered to write a proposal. 11. The issue of mutable buffers (topic 43) was considered for the first time. [Again the forum decided to consider this is isolation for now and not as part of bullets 8, 9 or 12.] It was decided that an operation for buffer create will have a flag which indicates that the address is mutable at run time. The default value of the flag will be set to not mutable to allow backward compatibility. [A more generic version will be considered when we will merge proposals for this topic with the next one (topic 44).] The effect of user allocated vs. system allocated memory as well as bufiter policies that match buffers for channel endpoints will be considered as part of this proposal. Steve agreed to write this proposal. 12. The related topic (topic 44) of flexible buffer size was also discussed for the first time. It was agreed that we will need two sizes: maximum possible buffer size and current buffer size. The interpretation of the maximum size for sending vs. receiving end point of the channel was discussed. The effect of the push vs. pull model let to the following agreement. The buffer size must be greater than the message transmission size specified by the bufiter (see bullet 9). The backward compatibility need to be addressed as well as the merger with previous topic. Steve agreed to write this proposal also. 13. The issue of multiple running versions of MPI/RT was brought up (topic 46). The idea of having a "unique" ID for each running version/session was considered. The sticky issue is the uniqueness of the ID across reboots. The use of time_spec along with platform specific and MPI/RT specific identifiers was considered. There was no requirement to get a quick solution. Rather a need of to consider this ID for debugging purposes, fault identification, and instrumentation for the future. Consensus was froming to treat it initially as part of the instrumentation as an extension. Leonard and Arkady agreed to work on this proposal. 14. The issue of the channel state transition diagram was revisited. While the desire was to consider the topic of waiting on an activated channel, it quickly became apparent that the current figure in MPI/RT-1.0 need some work. The meaning of labels on the transition, as well relationship of the condition raising to the states and transitions need to be clarified. It was agreed that while we will preserve the states that are visible to the users we will introduce some internal sub-states to them to help to understand the relationship between condition raisings and the states. The full description of the figure will be added to the text. The figure does not represents the requirements but just a tool to help visualize the states, transitions, and possible generate events that a channel can do. Arkady will update the document ASAP and distribute for further comments. 15. The next meeting of the forum will take place in MITRE in Bedford (Boston) in April 7-8. It will be 2 days (may be a day and a half). Arkady will host it. 16. The meeting adjourn at 10:30am on Friday Feb. 5. *************************************************************************************** The combined list of topics with their current schedules is stated below. TOPIC | Expected | | and person(s) in charge | Completion | Discussion | ---------------------------------------------------------------------------- 1 Data-reorganization support | M | E | Tony, James (from DATA REORG EFFORT) | | | | | | 2 Collective operations (ALL_TO_ALL) | E | E | Games, Arkady | | | | | | 3 Dynamic Process Management, | E-M | E | Parallel Client Server, etc. | | | Tony | | | | | | 4 Interoperation of MPI/RT Implementations | L | E | Tony | | | | | | 5 Support for Java binding of MPI/RT | L | M | (other languages too?) | | | Tony | | | | | | 6 Full-QOS-based MPI/RT with mode changes | M | 2/3/99 | (such as QOS capability for mode changes) | | | Arkady, Steve | | | | | | 7 Explicit ability to work with real-time | L | M | schedulers for process scheduling | | | Manoj | | | | | | 8 Work to support k-slack channel mode | M | 2/3/99 | (whereas MPI/RT 1.0 is 1-slack) | | | Shane, Tony | | | | | | 9 Support for IDL for MPI/RT | L | L | Tony | | | | | | 10 Fine grain, Coarse grain QOS specifications | M | M | Arkady, Leonard | | | | | | 11 More powerful data descriptor to allow | M | 2/3/99 | strided and non contiguous buffers | | | James, Tony | | | | | | 12 how to handle other resources' schedulers. | L | M | Currently we have dealt with memory and | | | network in two different ways and via | | | completely different specs... | | | (related to 7 and 10) | | | Arkady, Leonard, Manoj | | | | | | 13 Operations on containers. The old issues | E | E | of propagating the operations to every object| | | in a container (probably for the containers | | | that contain homogeneous objects only). | | | Nathan, Dennis | | | | | | 14 Container Iterators | E | E | Nathan, Dennis | | | | | | 15 MPIRT_SLEEP, MPIRT_WAIT, MPIRT_WAKEUP | E-M | 2/3/99 | Dennis, Arkady, Tony | | | | | | 16 Channel state transition (old busy state) | E-M | E | Cui, Arkady | | | | | | 17 MPIRT_CHANNEL_WAIT, MPIRT_CHANNEL_TEST | E-M | E | (related to 16) | | | Arkady, Cui | | | | | | 18 QoS Spec for channel (remove restrictions | E-M | E | that were made for MPI/RT-1.0 at the last | | | meeting) | | | Tony, Leonard, Manoj | | | | | | 19 Default Constructors | E | E | Cui, Nathan, Shane | | | | | | 20 New Instrumentation metrics | E-L | E-L | (Add as needed with experience) | | | | | | 21 More error returns | E-L | E-L | Arkady, Tony | | | | | | 22 Remove restriction of homogeneity on | E-M | 2/4/99 | bufiter parameters: | | | dataspec and bufsize for channel endpoints | | | Shane, Arkady | | | | | | 23 Unreliable Channels | M-L | 2/4/99 | Leonard, Tony | | | | | | 24 Fault Tolerance, Fault Handling, Fault | L | M | Recovery | | | | | | 25 Light-Weight Handlers | E | E | Tony | | | | | | 36 Half Channels (I/O) | M | 2/3/99 | Shane, Tony | | | | | | 37 Default Handlers | E | E | Tony | | | | | | 38 Group parameter scope for the commit | M | E | Tony, Arkady | | | | | | 39 Mechanisms for passing in externally build | M-L | M | schedules to MPI/RT | | | Arkady, Tony | | | | | | 40 Logical grouping of event kinds | L | M | (arithmetic on events) | | | Arkady, Leonard | | | | | | 41 PGCOs | M-L | M | Arkady | | | | | | 42 QoS for handlers (add deadlines) | M | E | Arkady, Steve | | | | | | 43 Mutable buffers (late binding of address) | E | 2/4/99 | Steve, Robert Ginn | | | | | | 44 Variable length buffers | E | 2/4/99 | Steve, Robert Ginn | | | | | | 45 Periodic triggers | M | E | Steve, Leonard, Arkady | | | | | | 46 Unique ID for running MPI/RT version | M-L | 2/4/99 | Leonard, Arkady | | | | | | 47 QoS for QoS (granularity for raising errors) | L | 2/5/99 | Leonard | | |