20040317icumeeting

10:02am 3/17/2004 - ICU meeting. Action items with @@

Ram- suppressing draft APIs

… [reformat this] * will try to get a compilation error for C APIS only * will have a switch to #ifdef out all C++ apis * #define U_DRAFT (not U_CAPI), substitute with static - to cause a compilation error * won't work for Apple - need control over access to the draft APIs on a per-client basis * but - user will be able to set (based on #defines) whether this is in effect or not. * we are also trying to handle enums and #defines - which we think we can get around by #defining them to 'garbage' or something not defined in the library. * U_DRAFT will either be regular export declaration or static. Will be a header like urename.h
U_DRAFT will either be regular export declaration or static. Will be a header like urename.h
1. will have a compilation error
Q: [apple] will some compilers complain?
1. ram will investigate to make sure we won't have any trouble on other platforms
2. we looked at other approaches with undefined prototypes, but that would just produce a warning, and a fallback behavior that would result in the API being redefined to an undefined symbol (link error) - i.e. ures_open #defined to ures_open_DRAFT
3. We do not want to put #ifdefs around each function, Also, U_DRAFT provides documentation right there
C++
1. Disabling can be accomplished by not publishing the header files. (hdrtest has the list)
2. We don't have a good idea how to suppress individual methods (in C++) which are draft,
3. Q:[apple] How to produce the list of C++ files?
  1. we could have a make target which removes the C++ files
  2. test/hdrtest/cxxfiles,txt contains a list of files.. with a # comment. This could work for apple. Apple would like it if dead C++ files were purged from cxxfiles.txt. Deborah already filed a bug on a better mechanism.

String Handling - Deborah requested different string handling in our APIs

Description - Deborah
1. Apple String handling
  1. uses ASCII internally sometimes for efficiency
  2. linear array is inefficient - Apple uses trees to avoid moving (say) 1g of text around if an editor is doing an insert
2. Problems
  1. some APIs, such as regex, require linear array to work
  2. some APIs support abstraction such as character iterator and replaceable, but they are char at a time. Concerned about slowness using string search, etc. Concerned that two function calls per character could be a problem
3. Request
  1. Make an abstraaction for text access universal, esp across regex
  2. Add a chunky character access, where you can get a chunk.
  3. Critical areas where there must not be a non-inline function call per character.
ICU Issues [Markus]
1. besides API, would have to rewrite internals such as search engine, for how the internals work.
2. Q: would an API like incremental conversion work? A: No, Regex requires random access.
3. Shouldn't be a unique situation to Apple. Would be nice to have regex be able to handle UTF8, say.
could hack something up for regex
1. Could have an iterator with two functions - a char, and a chunk
2. would need to do some mapping between internal char and the original char (say for utf8)
3. Apple doesn't attempt to make this work for DBCS or MBCS.
Would it make sense to make an API that works well for Apple on the interface they want and see how well it works using Replacable?
1. Apple: Regex has no comparison (to an existing Unicode regex) currently, but string search is going to be a problem.
2. Apple Should be able to update character iterator's API, so that there is a simple Asearch and replace of the caller, to use the chunked API
Q: We thought that Apple was concerned with C++ apis, is this now a C++ requirement?
1. we already have a wrapper for C regex, which calls C++ - concerned about how that implementation does character access.
could
1. ... add function pointers to ucharacteriterator which would do fastnext() and fastprev() which would by default call the slow per-char function. Can accomplish this by setting the bufffer size to 0 so that the per-char function is always called. This would give us the opportunity to migrate the implementations where it makes sense.. (some services could be migrated to use the new functions.)
2. fastnext [inline] would just call through to next() currently.
3. Frank: is there an issue of needing to access more than 16 bits of a character? Should the API return 32 bit?
  1. A: no, services detect this situation and fetch the trail surrogate already. UCharIterator is purposely simple.
4. @@ Deborah will write up a proposal. Will include the option of client-managed buffering.
  1. Andy: Regex needs Two new functions . regex is all random access due to backtracking.
    1. utf32 at index,
    2. and return a buffer of utf16s given an index.
5. Also, may be different needs between string search and regex, since string search already uses iterators (collationelementiterator - due to collation complexity.)
IBM: Scheduling issue for 3.0 on regex. Very slim chance that it would make it. No room for new features, unless we have some help.

Character properties access speed

Problem [Apple]
1. the way properties lookup is done is too slow
2. Apple Did some testing
3. Two forms of lookup
  1. internal form uses internal property data if already loaded - Bidi, canonical ordering, and internal-element class (modified GC)
  2. Internal was 2-3x faster than ICU
4. Apple uses separate 2-stage lookups per property - 7/9 - a lot of hits for the first 128 char range.
5. ICU uses a single trie - people already complain about the size of the property.
6. Perhaps less-frequent data could be moved to a separate trie?
7. Also, Longer runs of characters with separate tries per property - don't need 2ndary and tertiary tables - 5k-8k per property.
8. Markus optimized the trie -
  1. which is how we have the 11/5 shift.
  2. also, we have another indirection - a 16 bit index into a 32 bit (?) word . Would it be faster and smaller if the 32bit were split into two chunks?
9. @@ ICU - Could have an API that gives access to most of the 32 bit word in one call - general property, mirrored, and others. Deborah to file jitterbug.
  1. Some of the bits go to other internal tries - those bits could be set to zero.
10. Apple's critical: GC, CC, and Bidi category.
  1. Could get GC and Bidi - CC is in a different trie.
  2. Q: Which properties are being used by ICU users in general? A: it's hard enough for ICU to know which companies are using ICU, much less properties!
  3. Q: is this for rendering? A: yes, the rendering is the most critical, have to get several properties (bidi, etc).
    1. Could do a version as a test, which calls ICU for rendering, to the engineers and have them try it.
  4. Q: There's a function call that checks if data is already loaded (a static variable).. could find out how much overhead that adds.
  5. Apple has a lot of internal functions.. wants to migrate to ICU for these as long as (a) features and (b) performance is acceptable for adoption.
  6. Q: What happens with the Apple 9/7 shift, with supplementary characters? A: supplements are handled as exceptions. Apple has other functions which look up a full UTF32 char, with a ?/?/5 bit shift.
  7. Suggestion - Apple could cache the lookups from the existing ICU api, into a new data structure.
    1. Don't have to update the data file, but has fast access.
    2. ICU has lots of constraints on size. Have to work carefully if we would try to increase the data size.
    3. Footprint is an issue for Apple also - paging etc.

BreakIterator - George

Introduction
1. Also relevant to future structure of resource bundles (trees)
2. For 3.0 , will split off collation, rbnf, transliterator data
3. Q Apple: will this affect locality? I.e. will the data for a given locale be separated, will be on different pages for memory mapping. A: We could sort by locale so that all the pages of say 'fr.res' and 'collation/fr.res' are near each other.