Topic
- 10:02am 3/17/2004 - ICU meeting. Action items with @@
- Attendees
- IBM
- George, Markus, Doug, Ram , Alan, Vladimir, Steven (scribe), Andy
- AOL
- Frank Tang
- Apple
- Peter Edberg, Deborah Goldsmith
- Oracle
- Craig Cummings
- Ram- suppressing draft APIs
- … [reformat this]
* will try to get a compilation error for C APIS only
* will have a switch to #ifdef out all C++ apis
* #define U_DRAFT (not U_CAPI), substitute with static - to cause a compilation error
* won't work for Apple - need control over access to the draft APIs on a per-client basis
* but - user will be able to set (based on #defines) whether this is in effect or not.
* we are also trying to handle enums and #defines - which we think we can get around by #defining them to 'garbage' or something not defined in the library.
* U_DRAFT will either be regular export declaration or static. Will be a header like urename.h
- U_DRAFT will either be regular export declaration or static. Will be a header like urename.h
- will have a compilation error
- Q: [apple] will some compilers complain?
- ram will investigate to make sure we won't have any trouble on other platforms
- we looked at other approaches with undefined prototypes, but that would just produce a warning, and a fallback behavior that would result in the API being redefined to an undefined symbol (link error) - i.e. ures_open #defined to ures_open_DRAFT
- We do not want to put #ifdefs around each function,
Also, U_DRAFT provides documentation right there
- C++
- Disabling can be accomplished by not publishing the header files. (hdrtest has the list)
- We don't have a good idea how to suppress individual methods (in C++) which are draft,
- Q:[apple] How to produce the list of C++ files?
- we could have a make target which removes the C++ files
- test/hdrtest/cxxfiles,txt contains a list of files.. with a # comment. This could work for apple. Apple would like it if dead C++ files were purged from cxxfiles.txt. Deborah already filed a bug on a better mechanism.
- String Handling - Deborah requested different string handling in our APIs
- Description - Deborah
- Apple String handling
- uses ASCII internally sometimes for efficiency
- linear array is inefficient - Apple uses trees to avoid moving (say) 1g of text around if an editor is doing an insert
- Problems
- some APIs, such as regex, require linear array to work
- some APIs support abstraction such as character iterator and replaceable, but they are char at a time. Concerned about slowness using string search, etc. Concerned that two function calls per character could be a problem
- Request
- Make an abstraaction for text access universal, esp across regex
- Add a chunky character access, where you can get a chunk.
- Critical areas where there must not be a non-inline function call per character.
- ICU Issues [Markus]
- besides API, would have to rewrite internals such as search engine, for how the internals work.
- Q: would an API like incremental conversion work? A: No, Regex requires random access.
- Shouldn't be a unique situation to Apple. Would be nice to have regex be able to handle UTF8, say.
- could hack something up for regex
- Could have an iterator with two functions - a char, and a chunk
- would need to do some mapping between internal char and the original char (say for utf8)
- Apple doesn't attempt to make this work for DBCS or MBCS.
- Would it make sense to make an API that works well for Apple on the interface they want and see how well it works using Replacable?
- Apple: Regex has no comparison (to an existing Unicode regex) currently, but string search is going to be a problem.
- Apple Should be able to update character iterator's API, so that there is a simple Asearch and replace of the caller, to use the chunked API
- Q: We thought that Apple was concerned with C++ apis, is this now a C++ requirement?
- we already have a wrapper for C regex, which calls C++ - concerned about how that implementation does character access.
- could
- ... add function pointers to ucharacteriterator which would do fastnext() and fastprev() which would by default call the slow per-char function. Can accomplish this by setting the bufffer size to 0 so that the per-char function is always called. This would give us the opportunity to migrate the implementations where it makes sense.. (some services could be migrated to use the new functions.)
- fastnext [inline] would just call through to next() currently.
- Frank: is there an issue of needing to access more than 16 bits of a character? Should the API return 32 bit?
- A: no, services detect this situation and fetch the trail surrogate already. UCharIterator is purposely simple.
- @@ Deborah will write up a proposal. Will include the option of client-managed buffering.
- Andy: Regex needs Two new functions . regex is all random access due to backtracking.
- utf32 at index,
- and return a buffer of utf16s given an index.
- Also, may be different needs between string search and regex, since string search already uses iterators (collationelementiterator - due to collation complexity.)
- IBM: Scheduling issue for 3.0 on regex. Very slim chance that it would make it. No room for new features, unless we have some help.
- Character properties access speed
- Problem [Apple]
- the way properties lookup is done is too slow
- Apple Did some testing
- Two forms of lookup
- internal form uses internal property data if already loaded - Bidi, canonical ordering, and internal-element class (modified GC)
- Internal was 2-3x faster than ICU
- Apple uses separate 2-stage lookups per property - 7/9 - a lot of hits for the first 128 char range.
- ICU uses a single trie - people already complain about the size of the property.
- Perhaps less-frequent data could be moved to a separate trie?
- Also, Longer runs of characters with separate tries per property - don't need 2ndary and tertiary tables - 5k-8k per property.
- Markus optimized the trie -
- which is how we have the 11/5 shift.
- also, we have another indirection - a 16 bit index into a 32 bit (?) word . Would it be faster and smaller if the 32bit were split into two chunks?
- @@ ICU - Could have an API that gives access to most of the 32 bit word in one call - general property, mirrored, and others. Deborah to file jitterbug.
- Some of the bits go to other internal tries - those bits could be set to zero.
- Apple's critical: GC, CC, and Bidi category.
- Could get GC and Bidi - CC is in a different trie.
- Q: Which properties are being used by ICU users in general? A: it's hard enough for ICU to know which companies are using ICU, much less properties!
- Q: is this for rendering? A: yes, the rendering is the most critical, have to get several properties (bidi, etc).
- Could do a version as a test, which calls ICU for rendering, to the engineers and have them try it.
- Q: There's a function call that checks if data is already loaded (a static variable).. could find out how much overhead that adds.
- Apple has a lot of internal functions.. wants to migrate to ICU for these as long as (a) features and (b) performance is acceptable for adoption.
- Q: What happens with the Apple 9/7 shift, with supplementary characters? A: supplements are handled as exceptions. Apple has other functions which look up a full UTF32 char, with a ?/?/5 bit shift.
- Suggestion - Apple could cache the lookups from the existing ICU api, into a new data structure.
- Don't have to update the data file, but has fast access.
- ICU has lots of constraints on size. Have to work carefully if we would try to increase the data size.
- Footprint is an issue for Apple also - paging etc.
- BreakIterator - George
- Introduction
- Also relevant to future structure of resource bundles (trees)
- For 3.0 , will split off collation, rbnf, transliterator data
- Q Apple: will this affect locality? I.e. will the data for a given locale be separated, will be on different pages for memory mapping. A: We could sort by locale so that all the pages of say 'fr.res' and 'collation/fr.res' are near each other.