Minutes from the June 8-10 1999 MPI/RT Forum meeting that was held at LM-GES in Moorestown, NJ. Attendees: Arkady Kanevsky (co-chair) Mercury Computers arkady@mc.com Dennis Cottel SPAWAR (SYSCEN, San Diego) dennis@spawar.navy.mil Randall Judd SPAWAR (SYSCEN, San Diego) judd@spawar.navy.mil Nathan Doss LM/GES nathan.e.doss@lmco.com Tom McClean Lockheed Martin (LM/GES) Thomas.P.Mcclean@lmco.com Shane Hebert MSTI shane@mpi-softtech.com Michael Grieco ASEC/NSA mjgrieco@jswg.org Steve Paavola SKY Computers paavola@sky.com Anthony Skjellum (co-chair) MSTI tony@mpi-softtech.com James Lebak (minutes taker) MIT/LL jlebak@ll.mit.edu Yogi Dandass MSU yogi@erc.msstate.edu SUMMARY There was 1 official vote on a clarification to MPI/RT-1.0 document. A. MPIRT_INIT is a required synchronization point for all processes. Formal vote passed 7/0/2. There were several straw votes on the MPI/RT-1.0 document. 1. The name for the states should include the name of the object. This means that instead of "ACTIVATED" it should be "MPIRT_CHANNEL_ACTIVATED". See issue 9. Vote passed 7/0/2. 2. The changes for issue 21. NULL objects for all types. New operation to find out if an object is NULL of any type. Dup of NULL object refers to the same object. Vote 7/2/2. 3. The tables of all possible return codes for each function should not be done for MPI/RT-1.0 document, but should be done in MPI/RT-1.1 document. Vote 8/0/0. 4. No clarification is needed for best effort quality of service for triggers. Vote 9/0/2. 5. Channel activation is not allowed if in-bufiters for all channel endpoints are shared with other channels. Vote 6/0/5. 6. The changes for issue 27. Clarification for Commit for Best Effort QoS specification. Vote 8/0/3. There were several formal votes on MPI/RT-1.1 proposals. 1. The Sleep proposal with changes passes the official vote 7/0/1. 2. The Receptor Wait proposal with changes passed the official vote 8/0/0. I. The meeting came to order at 8:30am on Tuesday June 7. II. The Minutes of the April meeting where approved 8/0/1. The names of three proposals for relaxing data descriptor requirements will be added to the minutes. Change Morristown to Moorestown NJ. III. During most of the meeting the changes to MPI/RT-1.0 document were discussed. The following list of 46 items for change requests accumulated since the March meeting was presented. Majority of the items is done already. The remaining items and the changes agreed at this meeting will be done prior to the next one. 1. Go through entire document and use \state & \cond macro for the states and conditions of all objects. Done 2. In introduction chapter (chapter 1) write sections 1.3.4 (Erroneous programs) and sections 1.3.5 (Implementation libraries). Done 3. Remove "draft" from the cover page (last one to do!) Remains open, the last one to be done 4. C++ binding changes Done, see issue 19 for more changes for C++ binding approved by the meeting 5. Check C bindings (especially for Containers) Open. Add in object chapter that for MPIRT_CONTAINER users need to do typecasting to the specific container time (CSET, CVECTOR, GROUP) for C binding. No typecasting needed for C++. 6. Check how we use "transmittable" in glossary, in intro, in dataspec chapters. Open 7. Add to the preface "how to read the document" (what advises mean, naming conventions, and so on) Done 8. Resize the figures Done. Figure 2.1 on page 19 is too large. Reword citations after figures 8.1, 8.2. 9. Decide if we should add the name of the object before the state (another words instead of ACTIVATED it will be MPIRT_CHANNEL_ACTIVATED)? If decision is positive then do global change for all state names. The straw vote passed 7/0/2 for longer names. The current document (June 7 '99 version on web) already reflects this change. 10. Add examples to the final document. Everybody agreed that we need examples. There was a discussion on whether we should add them to the MPI/RT-1.0 document, MPI/RT-1.1 document, put them on the mpirt web page or in a separate document. It was pointed out that examples are not part of the document and are not subject to the vote. At the end it was agreed that the examples will not be a part of MPI/RT-1.0 document but will be a part of MPI/RT-1.1 document and will also be placed on the web. All the vendors should check if they can "donate" examples. All examples placed on the web and in the document should be tested on the vendor's implemented MPI/RT. MPI/RT editors will try to ensure that they also run on all existing MPI/RT implementations (Mercury, Sky, & CSPI). 11. Define MPIRT_ERR_OBJECT_NOT_COMMITTED error which will be return by a commitable object for operations that require the object to be committed. Add this error for all appropriate functions. Error is defined in the latest draft. Error description needs some rewording. 12. For each function define a table with all possible status return values and under what conditions each value is return. (May be just add these tables at the end of each chapter rather then in each function description) Synchronize with Annex A and the first (status returns) index. It was agreed that this should not be done in 1.0 but should be done in MPI/RT-1.1 document. The straw vote was taken and passed 8/0/0. 13. Define a generic error which will be return in all cases where a specific error name is not defined. Document all these places and generate a list of suitable error names for all these cases. It was decided not to add a generic error. The existing error MPIRT_ERR_INV_OBJECT will be changed to MPIRT_ERR_INV_ARGUMENT to handle generic invalid parameter to MPIRT operation. 14. Clarify what data is in the out-bufiter if the transfer is not successful. Clarify how an implementation is allowed to abort the transfer. Data in the "last" buffer is undefined if the error is returned. The buffer is still moved from input to output buffer iterator in the case of error. If QoS is pure priority or best effort the MPIRT_ERR_TRANSMISSION is returned. The related condition is defined for the channel object so that users can define a receptor and handlers to deal with it. For all other QoS specifications (deadline, timeout or stop event) the existing QoS error should be returned. For MPI/RT-1.1 investigate what can be done with transmission error status. Remove "Abort" (for channel transmission) from the document. 15. Can objects (commitable) be decorated at any time? After Commit? Yes, objects can be decorated at any time. Add a note to the text on that. 16. Elaborate on scatter_channel (as it is done for gather_channel). Done. Add a sentence that the receive buffers for all channel endpoints are the same. Add analogous sentence for gather channel also. Add the figures for the definitions for collective channels from the MPIDC MPIRT tutorial. 17. Define a missing CVECTOR_NULL in text. Done 18. All operations should return int. Fix returns for operations in 3.2.3. Done 19. Dup should be defined for all object not just leaves of the object hierarchy (implications for bindings) The discussion of this issue quickly centered what to do with operations that take non-leaf parameters. It was agreed that the current DUP definitions are wrong. The current DUP allows both copy constructor and clone. For leaf classes DUP operation will be define explicitly and no type casting is needed. Add advice to the users that for some operations on virtual classes type casting may be needed for some language bindings. For functions that takes non leaf type parameters (CSET_INSERT, CVECTOR_REPLACE, DUP, FREE) type casting is needed. Copy constructor should be described separately from DUP. Finally the following solution was agreed upon: a. Default constructor which initializing to object_NULL (of appropriate type). [int = obj1.create(.) ? user. int MPIRT::object::create(.);] b. Create fills it (C++ returns error code just like C) c. Dup is not copy constructor. (C++ returns error code just like C) [int=obj1.dup(obj2) ? user use; int MPIRT::obj1::DUP(obj2) ? C++ binding] Dup is defined for MPIRT_OBJECT. All other objects inherit it. Type casting is needed even for leaf. Users can either call dup on leaf object type directly or use dup of virtual class with type casting. No ==. d. Explicit free call separate from destructor. Use the same explanation as for constructors. [Rationale: No exception calling, error returns code instead. Exception handling too expensive.] For C++ add Free before destructor. Add a separate section for C++ for constructors and descructors. At the second day it was pointed out that C++ provides copy-constructor and op = automatically. That means that an MPI/RT object is handle to the actual object and that = and copy-constructor do byte- by-byte copy of an object handle and the new object is IS_EQUAL to the original object as defined by MPIRT_OBJECT_IS_EQUAL operation. 20. Define MPIRT_OBJECT_NULL. (static member of base class) Done 21. What are the NULL objects for non-leaf classes? Do we have them? Do we need them? Yes, we need NULL for all objects. Add new functions MPIRT_OBJECT_IS_NULL(obj,flag) which returns that the flag is MPIRT_TRUE if the object is a NULL object of any type. MPIRT__NULL constant has object type. NULL objects have object name of empty string "". Duplication of NULL object of any type refers to the same NULL object. This means that MPIRT_OBJECT_IS_EQUAL for a copy of the NULL object and NULL object will return MPIRT_IS_EQUAL. The straw vote was taken for these proposed changes, which passed 7/2/2. 22. What should is_equal operation return? (binding implications). What does it return for "comparison" for object_null, channel_null, pt_channel_null? See previous issue. 23. What is the meaning of the group WORLD if INIT is not a synchronization point? MPIRT_INIT is now a required synchronization point for all processes. Formal vote passed 7/0/2. 24. Should we add an error MPIRT_ERR_TIMEOUT for commit to return when one of the group WORLD members was not created? It was a general consensus that timeout for commit is useful and desirable. However it was decided to postpone this until MPI/RT-1.1 where is part of the QoS of commit issue and is a part of the RT mode change proposal. 25. Clarify which channel will transfer a buffer for shared bufiter for BEST_EFFORT QOS. This discussion was spread over two days. There are two real issues: one is what guarantees does implementation provides for best effort (and priorities), and which channel should transfer a buffers in the case of shared bufiters. For the first question the agreement was reached that if the resources are available then the progress shall be made according to the quality of service. This means that for the case of priorities, if resources are available the highest priority channel will transfer among all channels that are ready to transfer a message. In the case of equal priority or best effort (the lowest priority) any channel can make progress if resources are available and no higher priority channels are ready and if no scheduled messages exist for time-driven QoS spec. Replace "infinity" with bounded deadline (can be very large) in the definition of best effort on page 165. Clarify that if resources are not available then no progress will be made for a channel message transfer for best effort and priority QoS specifications. The lower priority channel can be delayed indefinitely if resources are not available. Remove advice to the users on page 165. For the second question there were several sub-issues to clarify. First, the clarification will be made to description to the START operation on page 149 to unsure that the buffer is removed from the input bufiter when the start returns. The completion of start is instantaneous and does not mean that the transfer had started. The situation with ACTIVATED is more complicated. The small state transition diagram for activated case was accepted. There are three states inside current activated state: Ready, Xferable, and Xfer data. When the buffer is inserted in an activated channel in-bufiter the channel transitions into Cferable. If some other channel (or user) removes the buffer from the bufiter and the input bufiter becomes empty the channel transitions back into Ready state. From Xferable when all other channel endpoints are in Xferable the channel endpoints all transition into Xfer data and transfer message(s) over the channel. At the completion of the transfer the channel transitions into Ready state again. This clarification will be imbedded into the figure 9.1 on page 153. The table was created which will be added to the document that defines what happens for channel endpoints in the case of started/activated and in the case of shared and not shared bufiters. All the cases are already defined except the case of shared bufiters and activated channel endpoints. For the cases of activated channel endpoints if only one side: send or receive has shared in-bufiter this is equivalent to get and put models respectively. The case when channel endpoints are activated and their in-bufiters are shared is NOT allowed for MPI/RT-1.0. This ensures that there are no deadlocks. The last statement had a straw vote that passed 6/0/5. 26. Clarify the meaning of MPIRT_QOS_BEST_EFFORT for triggers. No special clarification needed. Straw vote 9/0/2. 27. What is the requirement for MPIRT_QOS_BEST_EFFORT from COMMIT? Commit succeeds even though there is no "bandwidth" for that best-effort channel? There is a time at the end of a very long period? No need for changes. Issue 25 already covers it. Straw vote 8/0/3 passed. 28. Clarify the requirements for a user for putting/getting buffers to/from bufiters for barrier channel. Can the bufiter_null be used for that type of channel? Open 29. Check Anna's email comments that she sent to us. a. Anna's glossary comments b. Anna's comments on commit and data transfer c. Anna' detailed list of comments Done 30. Fix MPIRT_NULL problem in MPIRT_PROBE_CREATE Done 31. Agree on the format of for C++ for QOS and DATASPECS (should be a part of 4). Done. There are various consistency problems with indexes. Define in footnote what bold page means (definition). 32. Add underlying assumptions of reliability and ordering of messages (appropriate to QoSs) Done 33. In 2.2.5 remove that group is "read-only". Remove all "read-only" in other places. Done. Add that GROUP_WORLD, GROUP_SELF and GROUP_EMPTY are all constants and are read-only. All constants are read-only. 34. Add definition of reliability to Glossary. Open 35. Clarification for MPIRT_CONTINUE return for the last synch handler Done 36. Fix ...COND_QOS_FAILURE and ...COND_FAILURE as well as analogous cond for triggers and receptors. Check against analogous cond for channels. Done 37. Clarify what is rembuffer for CVECTOR_CREATE. A "semiformal" votes by email agreed on MPIRT_BUFFER_NULL. Done 38. Clarify that recoverable and retryable operation are only for standard defined operations and not user-defined ones. Done 39. Fix MPIRT_Int64_assign32 from int to unsigned int. Done 40. Clarification and advice to users how to create and manipulate groups. Done. More editorial changes needed. Remove the second half of the first paragraph of 2.5.2 (group manipulation are defined - inherited from vector container). Change rationale to regular text. Define "hole" as NULL object. Add a note that instantaneously a group can have duplicates but none at commit time. 41. Clarify the effect of changes (including free) of the original buffer on buffers that are derive from it using buffer partitioning operations. The original "big" buffer must be committed with all its children. All the buffers over the same SYSTEM_MEM_ALLOC must be committed together (user requirement). Free of buffers over the same memory must be done in the opposite order. Clarify that system allocated memory is allocated at commit time and free at uncommit time (and commit if a buffer not present). The characteristics of the original buffer (first to specify shared memory) cannot be changed if there are buffers derived from it (either dup without base change or buffer partitioning). The memory linkage is always to the first buffer that is created over the same memory using SYSTEM_MEM_ALLOC. Dereferencing (changing base address) of any of the intermediate buffers over the shared memory has no effect on its children who are still referencing the original buffer memory. Add an error if the system alignment for the dataspec is broken by the change of dataspec or base address. 42. Add definition of reduce operation (from MPI-1). Open 43. Clarify receptor duplication operation and event name in particular. Done. Copy the new text from section 4.2.2 (common EVDEL operations) to section 4.2.4 (receptors). Other small editorial changes. 44. Fix event name problems There is a problem with uniqueness of event names. The event delivery abstraction scope for MPI/RT- 1.0 is GROUP_WORLD. That means that the event name must for be unique over GROUP_WORLD. The forum considered a proposal of adding a group to the event name for implicitly generated events. For explicit events it is a user responsibility to ensure the uniqueness of the name. The advice to the user should be added to clarify this. As part of this discussion the decision was made to add a requirement that the name of the group object must be the same for all members (and users- see discussion below) of the group (processes of the group). This will simplify an implementation job of matching all the processes of the group to ensure that the group definition is correct. The old requirements of the same task address in the same order in the group (vector container) remains. A new error MPIRT_ERR_GROUP_INCONSISTENT will be added. This error covers both duplicate group names and that not all group processes created the right group. The old specification for the channel object implicitly generated events is a STRING_NAME of type "channel name":rank1:rank2:"condition name". This requires that the channel name is unique over the GROUP_WOLRD. The same channel name must also be unique over the group over which the channel is created. There was a discussion to change the format to group:rank:object type:object name: condition for all implicitly generated events. After long discussion it was agreed that it is not needed for MPI/RT-1.0 fix. All objects that can generate implicit events except channels are generate within one process and should have a unique name within this process. The rank is part of the event name for them so we can think of them as scoped over the group GROUP_WOLRD. Channels are scoped over the group over which they are defined. It is user responsibility to ensure that if there are receptors for the channel events then the channel name must also be unique over group world and hence unique in the process where endpoint is defined. Thus, channel event names become analogous to the other object event name if "rank 2" is dropped. The receptor's process for channel events does not have be a member of the group over which the channel is defined. The following two resolutions were made: "rank 2" is not needed for channel event names. When defining an implicit receptor, the scope passed to the receptor needs to be the same as the group over which the object is defined, and the receptor process does not have to be a member of that group. This means that if a user would like to have an implicit or an explicit receptor for an implicit event, the user needs to create a group over which the object generating the implicit event is defined (group of a channel and MPIRT_GROUP_WORLD for the rest) and the user needs to pass that group as an input parameter to an explicit receptor or a QoS specification for implicit receptor. 45. Define object names for object created by implementation (e.g. GROUP_WOLRD, GROUP_SELF) Done 46. Clarify that buffers in the initial buffer list are inserted into the buffer iterator at commit time. Done. V. At the next meeting the full review of the MPI/RT-1.0 document will take place with the formal votes. VI. The next Forum meeting will take place at Sky Computers on Sept. 8-10 W-F (half day Friday) (no Data Reorg). Data Reorg will take place at HPEC. The meeting after that will take place at Mercury Computers on Oct 19-22 Tu-Fr. MPIRT meeting will be 2 days Tu-Wed. Data Reorg 1and a half Th & Fr. Friday is half day. The final meeting of the year will be in San Diego Dec. 14-17 Tu-Fr. MPI/RT- 2days, Data Reorg 1 and a half. Friday is half day. VII. The k-slack proposal was discussed. There were several requests to clarify things. First, what is the relationship between QoS and k- slack? Second, the current proposal draft is geared toward 0-sided communication. Please, defined what happens for 1- & 2-sided models. Are multiple starts allowed with k-slack? What is the definition of WAIT for the k-slack channel? The same question for TEST, and channel conditions and implicit events. Add functionality so that a user can "wait" for a buffer associated with a particular message transfer (transaction). As part of this discussion the Forum asked for a clarification for MPIRT_CHANNEL_WAIT in MPI/RT-1.0 document on page 151 lines 13-27. VIII. Container Iterators proposal was discussed. The container argument should be flat (no nested containers of any type) and contains only objects of the type on which the operations can be performed. Two new functions FLATTEN aand FILTER should be added to help with that. The order in which the operation should be performed on the objects in the container is undefined and implementation is free to apply the operation on objects in parallel. New errors will be introduced if the input container is not flat or contains objects of the wrong type. IX. Sleep proposal was discussed. What happens on mode changes? Sleep should return when Commit or Uncommit is called. The return code should specify why returned. If specified time is an absolute time which had passed an error should be returned. The formal vote on the proposal had passed 7/0/1. X. Receptor wait proposal was discussed. What happens on mode changes? Receptor_Wait should return when Commit or Uncommit is called, or when receptor goes away. The return code should specify why returned. The formal vote on the proposal had passed 8/0/0. XI. The testing suite for MPI/RT-1.0 was discussed. The Forum should endorse it. For outreach a new webside "next" to mpirt web site should be created with cross pointers. What should be done for the first spec? By december it has only time for a small subset. What functionality should it include? Initially in C without eliminating C++. Initially should not concentrate on what happens for erroneous programs. Meta standardization: run command, error formats, hidden flags and access to implementation internals. Debug vs. production mode libraries. XII. The proposal for relaxing data descriptor requirements was discussed. Change "get" to "retrieve". Change MPIRT_INT to "int". It should represent the number of elements of the buffer dataspec. Change "length" to "count". Offset should be non-negative and is from the start of the buffer. The current proposal is geared to ptchannel, specify what effect these changes have for collective channels. Define relationship between buffer count + offset to the bufiter count. All dataspecs should match. Specify what size buffers can be inserted in bufiter ( for variable length and offset) and which part of the buffer is transferred (send side, receive side). The count of the received buffer is modified by the channel to reflect the received message size. XIII. Two new topics where added to MPI/RT-1.1 list of topics: 1. Allow Init, Finalize and new Init. 2. Helper function for freeing all buffers over specified system_mem_alloc. XIV. There was a small group discussion while Data Reorg was in preparation. There was a discussion on old Compare operation and object equivalency. Users themselves can decide what equivalency mean for each object type. Two objects for which it is hard to do are buffers and task_addresses. The following decisions were made to handle that: 1. For MPI/RT-1.1 we need MPIRT_BUFFER_GET_MASTER (buffer, &master, &offset) operation which returns the first buffer that requested the shared system_mem_alloc memory. The offset will be in bytes to handle the situation when dataspecs where changed in the tree of created buffers over the same memory. Hence, specifying the offset in dataspecs is not possible. 2. For MPI/RT-1.1 add MPIRT_DATASPEC_SIZEOF(dataspec) operations to find the size for dataspec in bytes. 3. For MPI/RT-1.0 remove DUP and FREE operations for TASK_ADDRESS.