> +/* > + * Allocate an object of a certain size > + * > + * Returns a per cpu pointer that must not be directly used. > + */ > +static void *cpu_alloc(unsigned long size) > +{
We might need to give an alignment constraint here. Some per_cpu users would like to get a 64 bytes zone, siting in one cache line and not two :)
Yes good idea, but a litle typo here :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet <da...@cosmosbay.com> Date: Thu, 01 Nov 2007 08:17:58 +0100
> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change > VM layout so that each cpu maps its own per_cpu area on this area, > so that the local per_cpu data sits in the same virtual address on > each cpu.
This is a mechanism used partially on IA64 already.
I think you have to be very careful, and you can only use this per-cpu fixed virtual address area in extremely limited cases.
The reason is, I think the address matters, consider list heads, for example.
So you couldn't do:
list_add(&obj->list, &per_cpu_ptr(list_head));
and use that per-cpu fixed virtual address.
IA64 seems to use it universally for every __get_cpu_var() access, so maybe it works out somehow :-)))
I guess if list modifications by remote cpus are disallowed, it would work (list traversal works because using the fixed virtual address as the list head sentinal is OK), but that is an extremely fragile assumption to base the entire mechanism upon. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> This hunk helped the sparc64 looping OOPS I was getting, but cpus hang > in some other fashion soon afterwards.
And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.
You'll definitely need to make this work dynamically somehow. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Nov 2007, Eric Dumazet wrote: > I think this question already came in the past and Linus already answered it, > but I again ask it. What about VM games with modern cpus (64 bits arches)
> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout > so that each cpu maps its own per_cpu area on this area, so that the local > per_cpu data sits in the same virtual address on each cpu. Then we dont need a > segment prefix nor adding a 'per_cpu offset'. No need to write special asm > functions to read/write/increment a per_cpu data and gcc could use normal > rules for optimizations.
> We only would need adding "per_cpu offset" to get data for a given cpu.
That is basically what IA64 is doing but it not usable because you would have addresses that mean different things on different cpus. List head for example require back pointers. If you put a listhead into such a per cpu area then you may corrupt another cpus per cpu area.
On Thu, 1 Nov 2007, Eric Dumazet wrote: > Christoph Lameter a écrit : > > + > > +enum unit_type { FREE, END, USED }; > > + > > +static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
> You mean END here instead of 1 :)
Sigh. A leftover. This can be removed.
> > +/* > > + * Allocate an object of a certain size > > + * > > + * Returns a per cpu pointer that must not be directly used. > > + */ > > +static void *cpu_alloc(unsigned long size) > > +{
> We might need to give an alignment constraint here. Some per_cpu users would > like to get a 64 bytes zone, siting in one cache line and not two :)
Well not sure about that. Alignment is mostly useful on SMP with cacheline contention. This is a per cpu area that should not be contended.
On Thu, 1 Nov 2007, David Miller wrote: > > This hunk helped the sparc64 looping OOPS I was getting, but cpus hang > > in some other fashion soon afterwards.
> And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.
Good....
> You'll definitely need to make this work dynamically somehow.
Obviously. Any ideas how?
I can probably calculate the size based on the number of online nodes when the per cpu areas are setup. But the setup is done before we even parse command line arguments. That would still mean a fixed size after bootup.
In order to make it truly dynamic we would have to virtually map the area. vmap? But that reduces performance.
Oh I see, it's the offset itself which is accessed at the fixed virtual address slot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter <clame...@sgi.com> Date: Thu, 1 Nov 2007 05:57:12 -0700 (PDT)
> That is basically what IA64 is doing but it not usable because you would > have addresses that mean different things on different cpus. List head > for example require back pointers. If you put a listhead into such a per > cpu area then you may corrupt another cpus per cpu area.
Indeed, but as I pointed out in another mail it actually works if you set some rules:
1) List insert and delete is only allowed on local CPU lists.
2) List traversal is allowed on remote CPU lists.
I bet we could get all of the per-cpu users to abide by this rule if we wanted to.
The remaining issue with accessing per-cpu areas at multiple virtual addresses is D-cache aliasing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter <clame...@sgi.com> Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)
> In order to make it truly dynamic we would have to virtually map the > area. vmap? But that reduces performance.
But it would still be faster than the double-indirection we do now, right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Nov 2007, David Miller wrote: > From: Christoph Lameter <clame...@sgi.com> > Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)
> > On Thu, 1 Nov 2007, David Miller wrote:
> > > The remaining issue with accessing per-cpu areas at multiple virtual > > > addresses is D-cache aliasing.
> > But that is not an issue for physicallly mapped caches.
> Right but I'd like to use this on sparc64 which has L1 D-cache > aliasing on some chips :-)
Hmmm... re my message I just send. Then we have to return the memory with the virtual address not with the physical address on sparc. May result in zones with holes though.
From: Christoph Lameter <clame...@sgi.com> Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)
> On Thu, 1 Nov 2007, David Miller wrote:
> > The remaining issue with accessing per-cpu areas at multiple virtual > > addresses is D-cache aliasing.
> But that is not an issue for physicallly mapped caches.
Right but I'd like to use this on sparc64 which has L1 D-cache aliasing on some chips :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Nov 2007, David Miller wrote: > From: Christoph Lameter <clame...@sgi.com> > Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)
> > In order to make it truly dynamic we would have to virtually map the > > area. vmap? But that reduces performance.
> But it would still be faster than the double-indirection we do now, > right?
I think I have an idea how to do this. Its a bit x86_64 specific but here it goes.
We define a virtual area of NR_CPUS * 2M areas that are each mapped by a PMD. That means we have a fixed virtual address for each cpus per cpu area.
First cpu is at PER_CPU_START Second cpu is at PER_CPU_START + 2M
So the per cpu area for cpu n is easily calculated using
PER_CPU_START + cpu << 19
without any lookups.
On bootup we allocate the 2M pages.
After boot is complete we allow the reduction of the size of the per cpu areas . Lets say we only need 128k per cpu. Then the remaining pages will be returned to the page allocator.
We create some sysfs thingy were one can see the current reserves of per cpu storage. If one wants to reduce memory then one can write something to that to return the remainder of the memory.
From: Christoph Lameter <clame...@sgi.com> Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> After boot is complete we allow the reduction of the size of the per cpu > areas . Lets say we only need 128k per cpu. Then the remaining pages will > be returned to the page allocator.
You don't know how much you will need. I exhausted the limit on sparc64 very late in the boot process when the last few userland services were starting up.
And if I subsequently bring up 100,000 IP tunnels, it will exhaust the per-cpu allocation area.
You have to make it fully dynamic, there is no way around it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Nov 2007, David Miller wrote: > From: Christoph Lameter <clame...@sgi.com> > Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> > After boot is complete we allow the reduction of the size of the per cpu > > areas . Lets say we only need 128k per cpu. Then the remaining pages will > > be returned to the page allocator.
> You don't know how much you will need. I exhausted the limit on > sparc64 very late in the boot process when the last few userland > services were starting up.
Well you would be able to specify how much will remain. If not it will just keep the 2M reserve around.
> And if I subsequently bring up 100,000 IP tunnels, it will exhaust the > per-cpu allocation area.
Each tunnel needs 4 bytes per cpu?
> You have to make it fully dynamic, there is no way around it.
Na. Some reasonable upper limit needs to be set. If we set that to say 32Megabytes and do the virtual mapping then we can just populate the first 2M and only allocate the remainder if we need it. Then we need to rely on Mel's defrag stuff though defrag memory if we need it.
> > From: Christoph Lameter <clame...@sgi.com> > > Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> > > After boot is complete we allow the reduction of the size of the per cpu > > > areas . Lets say we only need 128k per cpu. Then the remaining pages will > > > be returned to the page allocator.
> > You don't know how much you will need. I exhausted the limit on > > sparc64 very late in the boot process when the last few userland > > services were starting up.
> Well you would be able to specify how much will remain. If not it will > just keep the 2M reserve around.
> > And if I subsequently bring up 100,000 IP tunnels, it will exhaust the > > per-cpu allocation area.
> Each tunnel needs 4 bytes per cpu?
Each IP compression tunnel instance does an alloc_percpu().
Since you're the one who wants to change the semantics and guarentees of this interface, perhaps it might help if you did some greps around the tree to see how alloc_percpu() is actually used. That's what I did when I started running into trouble with your patches.
You cannot put limits of the amount of alloc_percpu() memory available to clients, please let's proceed with that basic understanding in mind. We're wasting a ton of time discussing this fundamental issue. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
>> From: Christoph Lameter <clame...@sgi.com> >> Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
>>> After boot is complete we allow the reduction of the size of the per cpu >>> areas . Lets say we only need 128k per cpu. Then the remaining pages will >>> be returned to the page allocator. >> You don't know how much you will need. I exhausted the limit on >> sparc64 very late in the boot process when the last few userland >> services were starting up.
> Well you would be able to specify how much will remain. If not it will > just keep the 2M reserve around.
>> And if I subsequently bring up 100,000 IP tunnels, it will exhaust the >> per-cpu allocation area.
> Each tunnel needs 4 bytes per cpu?
well, if we move last_rx to a percpu var, we need 8 bytes of percpu space per net_device :)
>> You have to make it fully dynamic, there is no way around it.
> Na. Some reasonable upper limit needs to be set. If we set that to say > 32Megabytes and do the virtual mapping then we can just populate the first > 2M and only allocate the remainder if we need it. Then we need to rely on > Mel's defrag stuff though defrag memory if we need it.
If a 2MB page is not available, could we revert using 4KB pages ? (like vmalloc stuff), paying an extra runtime overhead of course.
On Fri, 2 Nov 2007, Eric Dumazet wrote: > > Na. Some reasonable upper limit needs to be set. If we set that to say > > 32Megabytes and do the virtual mapping then we can just populate the first > > 2M and only allocate the remainder if we need it. Then we need to rely on > > Mel's defrag stuff though defrag memory if we need it.
> If a 2MB page is not available, could we revert using 4KB pages ? (like > vmalloc stuff), paying an extra runtime overhead of course.
Sure. Its going to be like vmemmap. There will be limited imposed though by the amount of virtual space available. Basically the dynamic per cpu area can be at maximum
On Thu, 1 Nov 2007, David Miller wrote: > You cannot put limits of the amount of alloc_percpu() memory available > to clients, please let's proceed with that basic understanding in > mind. We're wasting a ton of time discussing this fundamental issue.
There is no point in making absolute demands like "no limits". There are always limits to everything.
A new implementation avoids the need to allocate per cpu arrays and also avoids the 32 bytes per object times cpus that are mostly wasted for small allocations today. So its going to potentially allow more per cpu objects that available today.
A reasonable implementation for 64 bit is likely going to depend on reserving some virtual memory space for the per cpu mappings so that they can be dynamically grown up to what the reserved virtual space allows.
F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus then there is a limit on the per cpu space available of 16MB. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hmmm... On x86_64 we could take 8 terabyte virtual space (bit order 43)
With the worst case scenario of 16k of cpus (bit order 16) we are looking at 43-16 = 27 ~ 128MB per cpu. Each percpu can at max be mapped by 64 pmd entries. 4k support is actually max for projected hw. So we'd get to 512M.
On IA64 we could take half of the vmemmap area which is 45 bits. So we could get up to 512MB (with 16k pages, 64k pages can get us even further) assuming we can at some point run 16 processors per node (4k is the current max which would put the limit on the per cpu area >1GB).
Lets say you have a system with 64 cpus and an area of 128M of per cpu storage. Then we are using 8GB of total memory for per cpu storage. The 128M allows us to store f.e. 16 M of word size counters.
With SLAB and the current allocpercpu you would need the following for 16M counters:
16M*32*64 (minimum alloc size of SLAB is 32 byte and we alloc via kmalloc) for the data.
16M*64*8 for the pointer arrays. 16M allocpercpu areas for 64 processors and a pointer size of 8 bytes.
So you would need to use 40G in current systems. The new scheme would only need 8GB for the same amount of counters.
So I think its unreasonable to assume that currently systems exist that can use more than 128m of allocpercpu space (assuming 64 cpus).
From: Christoph Lameter <clame...@sgi.com> Date: Thu, 1 Nov 2007 18:06:17 -0700 (PDT)
> A reasonable implementation for 64 bit is likely going to depend on > reserving some virtual memory space for the per cpu mappings so that they > can be dynamically grown up to what the reserved virtual space allows.
> F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus > then there is a limit on the per cpu space available of 16MB.
Now that I understand your implementation better, yes this sounds just fine. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote: > Since you're the one who wants to change the semantics and guarentees > of this interface, perhaps it might help if you did some greps around > the tree to see how alloc_percpu() is actually used. That's what > I did when I started running into trouble with your patches.
This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().
That means that for example each NFS mount also consumes a number of words - not quite sure from the top of my head how many, might be in the order of 24 bytes or something.
I once before started looking at this, because the current alloc_percpu() can have some false sharing - not that I have machines that are overly bothered by that. I like the idea of a strict percpu region, however do be aware of the users.
On Fri, 2 Nov 2007, Peter Zijlstra wrote: > On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:
> > Since you're the one who wants to change the semantics and guarentees > > of this interface, perhaps it might help if you did some greps around > > the tree to see how alloc_percpu() is actually used. That's what > > I did when I started running into trouble with your patches.
> This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().
Yes there are numerous uses. I even can increase page allocator performance and reduce its memory footprint by using it here.
> That means that for example each NFS mount also consumes a number of > words - not quite sure from the top of my head how many, might be in the > order of 24 bytes or something.
> I once before started looking at this, because the current > alloc_percpu() can have some false sharing - not that I have machines > that are overly bothered by that. I like the idea of a strict percpu > region, however do be aware of the users.
Well I wonder if I should introduce it not as a replacement but as an alternative to allocpercpu? We can then gradually switch over. The existing API does not allow the specification of gfp_masks or alignements.