Python bindings: Buffer class issue

Hi all, Quick summary from GFD.90: the SAGA I/O Buffer encapsulates a sequence of bytes to be used for I/O operations, e.g. read()/write() on files and streams, and call() on rpc instances. The recent removal of the buffer class from the Python bindings of the C++ SAGA implementation led us to think again about this issue. The GFD is C/C++ oriented and therefore the Python implementation is all but clear in this regard. Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears. There is no need for a Python SAGA user to tell the bindings who manages the Buffer, since it is managed by the underlying Python VM. Another more critical issue is the data type used to hold binary data in Python. In Python 2.x the immutable 'str' type is used whereas Python 3.x has a newly introduced immutable 'bytes' type. Let's forget about 3.x for a moment, since 2.x will be around for at least a couple of more years. In order to manipulate large binary datasets, the mmap class [0] could be used, which basically transforms a immutable 'str' into a mutable mmap object. In other words it provides the ability to efficiently modify binary data. In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc. Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings? 2) Is handling of large binary datasets a primary concern? If yes, how to handle them? 3) Is compliance to Python 3.x a concern right now? In other words, is the eventual migration to 3.x to take into consideration? Cheers, /Manuel [0] http://docs.python.org/library/mmap.html

Quoting [Manuel Franceschini] (Nov 09 2009):
Hi all,
Quick summary from GFD.90: the SAGA I/O Buffer encapsulates a sequence of bytes to be used for I/O operations, e.g. read()/write() on files and streams, and call() on rpc instances. The recent removal of the buffer class from the Python bindings of the C++ SAGA implementation led us to think again about this issue. The GFD is C/C++ oriented
Well, it should not be C/C++ oriented, but the bias of the authors probably shows :-) The intent was to support binary I/O on any language, as that was mentioned in many use cases.
and therefore the Python implementation is all but clear in this regard.
Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears.
From what I learned during the discussion in Banff, this is not really true: one *can* allocate an array in user space and pass it to an API by-reference, which actually makes it a application managed memory segment. The point in python seems to be that nobody is doing that...
There is no need for a Python SAGA user to tell the bindings who manages the Buffer, since it is managed by the underlying Python VM.
Another more critical issue is the data type used to hold binary data in Python. In Python 2.x the immutable 'str' type is used whereas Python 3.x has a newly introduced immutable 'bytes' type. Let's forget about 3.x for a moment, since 2.x will be around for at least a couple of more years. In order to manipulate large binary datasets, the mmap class [0] could be used, which basically transforms a immutable 'str' into a mutable mmap object. In other words it provides the ability to efficiently modify binary data.
In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc.
If the bindings decide to go for strings, then that should pose no problem for the async calls, as far as I can tell: semantics of sync and async calls is identical (apart from synchronization obviously).
Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings?
According to the original SAGA use cases: no According to current SAGA users: yes So, tough call ;-)
2) Is handling of large binary datasets a primary concern? If yes, how to handle them?
See above. How to handle: dunno - that is the question, innit?
3) Is compliance to Python 3.x a concern right now? In other words, is the eventual migration to 3.x to take into consideration?
If 3.x makes something easier, it might be good to be aware of it at least. I think all agree that 2.x will be around for a long time, and that limiting the bindings to 3.x is not an option. OTOH, it should be possible to have slightly differing bindings for 2.x and 3.x, depending on the changes in the language itself. One proposal which came up a couple of times, and which I find appealing, is to have support for strings (simple, solves many use cases, pythonesque), and to add binary buffers for python-3.x (natively supported, covers the remaining use cases, stays close to spec). Personally, I don't see the need for jumping through hoops for python-2.x. Cheers, Andre.
Cheers, /Manuel
[0] http://docs.python.org/library/mmap.html -- Nothing is ever easy.

On Mon, Nov 9, 2009 at 11:10 PM, Andre Merzky <andre@merzky.net> wrote:
Quoting [Manuel Franceschini] (Nov 09 2009):
Hi all,
Quick summary from GFD.90: the SAGA I/O Buffer encapsulates a sequence of bytes to be used for I/O operations, e.g. read()/write() on files and streams, and call() on rpc instances. The recent removal of the buffer class from the Python bindings of the C++ SAGA implementation led us to think again about this issue. The GFD is C/C++ oriented
Well, it should not be C/C++ oriented, but the bias of the authors probably shows :-) The intent was to support binary I/O on any language, as that was mentioned in many use cases.
and therefore the Python implementation is all but clear in this regard.
Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears.
From what I learned during the discussion in Banff, this is not really true: one *can* allocate an array in user space and pass it to an API by-reference, which actually makes it a application managed memory segment. The point in python seems to be that nobody is doing that...
Well, in Python there is *only* by-reference parameter passing, references to objects that is. Version 2.6 introduced an io module that allows to do what you describe. One problem with this is that our JySAGA bindings can't support this new feature as Jython just reached version 2.5.1 and it looks like there is quite a long way to go to 2.6. I did some memory profiling with large chunks of data copied from one file to another and the automatic memory management in Python seemed to be very efficient. In my tests the garbage collection was instantaneously. In other words, as soon as there was no more references to a data chunk, memory was deallocated. So when shuffling 1MB chunks 10000 times from one file to another, the memory consumption of the test program never exceeded 2,5 MB. If somebody can come up with a test program that shows the advantage of using the new io module in relevant use cases, we could think about using it in the C++ bindings. Otherwise, why optimize when there's not real problem?
There is no need for a Python SAGA user to tell the bindings who manages the Buffer, since it is managed by the underlying Python VM.
Another more critical issue is the data type used to hold binary data in Python. In Python 2.x the immutable 'str' type is used whereas Python 3.x has a newly introduced immutable 'bytes' type. Let's forget about 3.x for a moment, since 2.x will be around for at least a couple of more years. In order to manipulate large binary datasets, the mmap class [0] could be used, which basically transforms a immutable 'str' into a mutable mmap object. In other words it provides the ability to efficiently modify binary data.
In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc.
If the bindings decide to go for strings, then that should pose no problem for the async calls, as far as I can tell: semantics of sync and async calls is identical (apart from synchronization obviously).
Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings?
According to the original SAGA use cases: no According to current SAGA users: yes
So, tough call ;-)
What do other people think?
2) Is handling of large binary datasets a primary concern? If yes, how to handle them?
See above. How to handle: dunno - that is the question, innit?
The mmap module can be used for modifying binary data in place.
3) Is compliance to Python 3.x a concern right now? In other words, is the eventual migration to 3.x to take into consideration?
If 3.x makes something easier, it might be good to be aware of it at least. I think all agree that 2.x will be around for a long time, and that limiting the bindings to 3.x is not an option. OTOH, it should be possible to have slightly differing bindings for 2.x and 3.x, depending on the changes in the language itself.
Yeah, I don't think we should think too much about that now. But for the future it will bring several benefits to the Python bindings. Cheers, /Manuel

Quoting [Manuel Franceschini] (Nov 11 2009):
On Mon, Nov 9, 2009 at 11:10 PM, Andre Merzky <andre@merzky.net> wrote:
Quoting [Manuel Franceschini] (Nov 09 2009):
Hi all,
Quick summary from GFD.90: the SAGA I/O Buffer encapsulates a sequence of bytes to be used for I/O operations, e.g. read()/write() on files and streams, and call() on rpc instances. The recent removal of the buffer class from the Python bindings of the C++ SAGA implementation led us to think again about this issue. The GFD is C/C++ oriented
Well, it should not be C/C++ oriented, but the bias of the authors probably shows :-) The intent was to support binary I/O on any language, as that was mentioned in many use cases.
and therefore the Python implementation is all but clear in this regard.
Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears.
From what I learned during the discussion in Banff, this is not really true: one *can* allocate an array in user space and pass it to an API by-reference, which actually makes it a application managed memory segment. The point in python seems to be that nobody is doing that...
Well, in Python there is *only* by-reference parameter passing, references to objects that is. Version 2.6 introduced an io module that allows to do what you describe. One problem with this is that our JySAGA bindings can't support this new feature as Jython just reached version 2.5.1 and it looks like there is quite a long way to go to 2.6.
That is an implementation problem, and should not influence the python bindings, right? ;-)
I did some memory profiling with large chunks of data copied from one file to another and the automatic memory management in Python seemed to be very efficient. In my tests the garbage collection was instantaneously. In other words, as soon as there was no more references to a data chunk, memory was deallocated. So when shuffling 1MB chunks 10000 times from one file to another, the memory consumption of the test program never exceeded 2,5 MB. If somebody can come up with a test program that shows the advantage of using the new io module in relevant use cases, we could think about using it in the C++ bindings. Otherwise, why optimize when there's not real problem?
Fair point. But, BTW, I don't see app managed buffers for optimizing memory consumption, but for optimizing latency, as you save memcopy calls. In theory at least...
There is no need for a Python SAGA user to tell the bindings who manages the Buffer, since it is managed by the underlying Python VM.
Another more critical issue is the data type used to hold binary data in Python. In Python 2.x the immutable 'str' type is used whereas Python 3.x has a newly introduced immutable 'bytes' type. Let's forget about 3.x for a moment, since 2.x will be around for at least a couple of more years. In order to manipulate large binary datasets, the mmap class [0] could be used, which basically transforms a immutable 'str' into a mutable mmap object. In other words it provides the ability to efficiently modify binary data.
In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc.
If the bindings decide to go for strings, then that should pose no problem for the async calls, as far as I can tell: semantics of sync and async calls is identical (apart from synchronization obviously).
Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings?
According to the original SAGA use cases: no According to current SAGA users: yes
So, tough call ;-)
What do other people think?
anybody??
3) Is compliance to Python 3.x a concern right now? In other words, is the eventual migration to 3.x to take into consideration?
If 3.x makes something easier, it might be good to be aware of it at least. I think all agree that 2.x will be around for a long time, and that limiting the bindings to 3.x is not an option. OTOH, it should be possible to have slightly differing bindings for 2.x and 3.x, depending on the changes in the language itself.
Yeah, I don't think we should think too much about that now. But for the future it will bring several benefits to the Python bindings.
agree. Cheers, Andre. -- Nothing is ever easy.

On Wed, 2009-11-11 at 21:11 +0100, Andre Merzky wrote:
Quoting [Manuel Franceschini] (Nov 11 2009):
On Mon, Nov 9, 2009 at 11:10 PM, Andre Merzky <andre@merzky.net> wrote:
Quoting [Manuel Franceschini] (Nov 09 2009):
Hi all,
Quick summary from GFD.90: the SAGA I/O Buffer encapsulates a sequence of bytes to be used for I/O operations, e.g. read()/write() on files and streams, and call() on rpc instances. The recent removal of the buffer class from the Python bindings of the C++ SAGA implementation led us to think again about this issue. The GFD is C/C++ oriented
Well, it should not be C/C++ oriented, but the bias of the authors probably shows :-) The intent was to support binary I/O on any language, as that was mentioned in many use cases.
and therefore the Python implementation is all but clear in this regard.
Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears.
From what I learned during the discussion in Banff, this is not really true: one *can* allocate an array in user space and pass it to an API by-reference, which actually makes it a application managed memory segment. The point in python seems to be that nobody is doing that...
Well, in Python there is *only* by-reference parameter passing, references to objects that is. Version 2.6 introduced an io module that allows to do what you describe. One problem with this is that our JySAGA bindings can't support this new feature as Jython just reached version 2.5.1 and it looks like there is quite a long way to go to 2.6.
That is an implementation problem, and should not influence the python bindings, right? ;-)
Well, defining bindings that break all current implementations and their usage won't work either. The C++ wrapper now also requires Python >= 2.2. Would all current users be willing/able to upgrade to >= 2.6? The bindings will have to define which Python version is required. It not only a matter of 2.x or 3.x; the 2.x versions also contain increasingly more relevant functionality. We can either opt for something low (e.g. >= 2.2) to increase acceptance, or something high (e.g. >= 2.6 or >= 3.0) if these contain features that are essential for the bindings. A third option is to specify optional additional functionality for an implementation that's only targeted at newer versions of Python, but that will probably generate a lot of confusion. I'd say we stick to >= 2.2; widely used, and supported by all current implementations.
I did some memory profiling with large chunks of data copied from one file to another and the automatic memory management in Python seemed to be very efficient. In my tests the garbage collection was instantaneously. In other words, as soon as there was no more references to a data chunk, memory was deallocated. So when shuffling 1MB chunks 10000 times from one file to another, the memory consumption of the test program never exceeded 2,5 MB. If somebody can come up with a test program that shows the advantage of using the new io module in relevant use cases, we could think about using it in the C++ bindings. Otherwise, why optimize when there's not real problem?
Fair point.
But, BTW, I don't see app managed buffers for optimizing memory consumption, but for optimizing latency, as you save memcopy calls. In theory at least...
There is no need for a Python SAGA user to tell the bindings who manages the Buffer, since it is managed by the underlying Python VM.
Another more critical issue is the data type used to hold binary data in Python. In Python 2.x the immutable 'str' type is used whereas Python 3.x has a newly introduced immutable 'bytes' type. Let's forget about 3.x for a moment, since 2.x will be around for at least a couple of more years. In order to manipulate large binary datasets, the mmap class [0] could be used, which basically transforms a immutable 'str' into a mutable mmap object. In other words it provides the ability to efficiently modify binary data.
Not really; it memory-maps a file, not an arbitrary string. However, you can easily convert a string to a list or array and manipulate that in place. The real question is: which use cases are we trying to optimize? What will SAGA Python apps do with binary data?
In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc.
If the bindings decide to go for strings, then that should pose no problem for the async calls, as far as I can tell: semantics of sync and async calls is identical (apart from synchronization obviously).
Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings?
According to the original SAGA use cases: no According to current SAGA users: yes
What were the original use cases that required a Buffer class?
So, tough call ;-)
What do other people think?
anybody??
3) Is compliance to Python 3.x a concern right now? In other words, is the eventual migration to 3.x to take into consideration?
If 3.x makes something easier, it might be good to be aware of it at least. I think all agree that 2.x will be around for a long time, and that limiting the bindings to 3.x is not an option. OTOH, it should be possible to have slightly differing bindings for 2.x and 3.x, depending on the changes in the language itself.
Yeah, I don't think we should think too much about that now. But for the future it will bring several benefits to the Python bindings.
agree.
Cheers, Andre.
-Mathijs

Quoting [Mathijs den Burger] (Nov 12 2009):
Given that that memory management is automatic in Python, the notion of application-managed and implementation-managed Buffer disappears.
From what I learned during the discussion in Banff, this is not really true: one *can* allocate an array in user space and pass it to an API by-reference, which actually makes it a application managed memory segment. The point in python seems to be that nobody is doing that...
Well, in Python there is *only* by-reference parameter passing, references to objects that is. Version 2.6 introduced an io module that allows to do what you describe. One problem with this is that our JySAGA bindings can't support this new feature as Jython just reached version 2.5.1 and it looks like there is quite a long way to go to 2.6.
That is an implementation problem, and should not influence the python bindings, right? ;-)
Well, defining bindings that break all current implementations and their usage won't work either. The C++ wrapper now also requires Python >= 2.2. Would all current users be willing/able to upgrade to >= 2.6?
*sigh* Tough one. Obviously, I'd love to define the bindings more from a (assumed) user perspective POV, but sure, if there is no possible implementation, that makes not too much sense...
The bindings will have to define which Python version is required. It not only a matter of 2.x or 3.x; the 2.x versions also contain increasingly more relevant functionality.
That may make sense, but of course it is confusing to have too many levels here. I guess that 2.x versus 3.x is a natural watershed, whereas 2.x versus 2.y may look somewhat arbitrary to the user.
We can either opt for something low (e.g. >= 2.2) to increase acceptance, or something high (e.g. >= 2.6 or >= 3.0) if these contain features that are essential for the bindings. A third option is to specify optional additional functionality for an implementation that's only targeted at newer versions of Python, but that will probably generate a lot of confusion.
I'd say we stick to >= 2.2; widely used, and supported by all current implementations.
Agree, but also would suggest to have an updated version for 3.x, as seems to have fundamental changes. But, once more, I am not a python expert, so my opinion should not be weighted too highly...
Not really; it memory-maps a file, not an arbitrary string. However, you can easily convert a string to a list or array and manipulate that in place.
The real question is: which use cases are we trying to optimize? What will SAGA Python apps do with binary data?
The primier use asking for optimized I/O on streams and files are visualization use cases (you should know *those* use cases, right? ;-) So, an application is repeatedly readong a couple of megabyte or so from a stream or a file, into a memory buffer, at say 2*30fps, to run some algorithm on the data (say isosurfacer). Now, in the simple case your kernel allocates memory for the read you request, and reads data into that memory. Later, that data is copied into your application memory so your app can access the data. The buffer object allows to allocate the memory in the application, and to pass the buffer down to the kernel, so that it can read data directly into the buffer (zero copy: no need to copy it again). I have no idea how prevalent that use case and similar ones are for the python community we are targeting, so its very hard to argue for or against that optimization. Zero Copy is in general useful in other use cases, too, but again, it is hard to guesstimate the gain without any specific use case to discuss.
In the VU Python bindings the buffer class is still present, while, as previously said, in the C++ Python bindings it was removed recently. I do not see any issues with the removal of the Buffer class in the Python bindings. However, I'm not sure whether I am forgetting some corner cases (e.g. async) that would require a dedicated Buffer class. When removing the Buffer class, the user would simply deal with 'str' type data to pass data back and forth to a SAGA file, stream or rpc.
If the bindings decide to go for strings, then that should pose no problem for the async calls, as far as I can tell: semantics of sync and async calls is identical (apart from synchronization obviously).
Now, I identified the following crucial questions: 1) Can the Buffer class be safely removed from the Python bindings?
According to the original SAGA use cases: no According to current SAGA users: yes
What were the original use cases that required a Buffer class?
See above. You may also want to have a look at the SAGA use case doc and the SAGA req doc http://www.ogf.org/documents/GFD.70.pdf http://www.ogf.org/documents/GFD.71.pdf Cheers, Andre. -- Nothing is ever easy.

On Sat, 2009-11-14 at 12:04 +0100, Andre Merzky wrote:
The real question is: which use cases are we trying to optimize? What will SAGA Python apps do with binary data?
The primier use asking for optimized I/O on streams and files are visualization use cases (you should know *those* use cases, right? ;-) So, an application is repeatedly readong a couple of megabyte or so from a stream or a file, into a memory buffer, at say 2*30fps, to run some algorithm on the data (say isosurfacer).
Now, in the simple case your kernel allocates memory for the read you request, and reads data into that memory. Later, that data is copied into your application memory so your app can access the data.
The buffer object allows to allocate the memory in the application, and to pass the buffer down to the kernel, so that it can read data directly into the buffer (zero copy: no need to copy it again).
I have no idea how prevalent that use case and similar ones are for the python community we are targeting, so its very hard to argue for or against that optimization. Zero Copy is in general useful in other use cases, too, but again, it is hard to guesstimate the gain without any specific use case to discuss.
OK, time to wrap up. I propose to NOT include a 'Buffer' class in the SAGA bindings for Python 2.x (i.e. >= 2.2). Instead of buffers we'll use strings. Reasons: 1. it is hard to express in Python 2.x (no native 'byte' type, no 'readinto' with a buffer to a pointer, etc.) 2. current users of the C++ Python wrapper do not seem to miss it 3. the number of use cases for buffers is fairly low (GFD.70 and GFD.71 both contain the word 'buffer' only once!) 4. it requires less changes to the C++ Python wrapper code A dedicated Buffer class MAY be added to a new version of the bindings for Python 3.x. Reasons: 1. Python 3.x does have a native 'byte' type and methods like 'readinto' 2. New users may desire it Sounds reasonable? best, Mathijs

A dedicated Buffer class MAY be added to a new version of the bindings for Python 3.x. Reasons:
MAY -> SHOULD ? Otherwise: +1 A. Quoting [Mathijs den Burger] (Nov 18 2009):
Subject: Re: [SAGA-RG] Python bindings: Buffer class issue From: Mathijs den Burger <mathijs@cs.vu.nl> To: Andre Merzky <andre@merzky.net> Cc: Manuel Franceschini <livewire@koltern.com>, SAGA-RG <saga-rg@ogf.org> Date: Wed, 18 Nov 2009 14:04:56 +0100
On Sat, 2009-11-14 at 12:04 +0100, Andre Merzky wrote:
The real question is: which use cases are we trying to optimize? What will SAGA Python apps do with binary data?
The primier use asking for optimized I/O on streams and files are visualization use cases (you should know *those* use cases, right? ;-) So, an application is repeatedly readong a couple of megabyte or so from a stream or a file, into a memory buffer, at say 2*30fps, to run some algorithm on the data (say isosurfacer).
Now, in the simple case your kernel allocates memory for the read you request, and reads data into that memory. Later, that data is copied into your application memory so your app can access the data.
The buffer object allows to allocate the memory in the application, and to pass the buffer down to the kernel, so that it can read data directly into the buffer (zero copy: no need to copy it again).
I have no idea how prevalent that use case and similar ones are for the python community we are targeting, so its very hard to argue for or against that optimization. Zero Copy is in general useful in other use cases, too, but again, it is hard to guesstimate the gain without any specific use case to discuss.
OK, time to wrap up. I propose to NOT include a 'Buffer' class in the SAGA bindings for Python 2.x (i.e. >= 2.2). Instead of buffers we'll use strings. Reasons:
1. it is hard to express in Python 2.x (no native 'byte' type, no 'readinto' with a buffer to a pointer, etc.)
2. current users of the C++ Python wrapper do not seem to miss it
3. the number of use cases for buffers is fairly low (GFD.70 and GFD.71 both contain the word 'buffer' only once!)
4. it requires less changes to the C++ Python wrapper code
A dedicated Buffer class MAY be added to a new version of the bindings for Python 3.x. Reasons:
1. Python 3.x does have a native 'byte' type and methods like 'readinto'
2. New users may desire it
Sounds reasonable?
best, Mathijs
-- Nothing is ever easy.
participants (3)
-
Andre Merzky
-
Manuel Franceschini
-
Mathijs den Burger