This patch increases the speed of the SLUB fastpath by improving the per cpu allocator and makes it usable for SLUB.
Currently allocpercpu manages arrays of pointer to per cpu objects. This means that is has to allocate the arrays and then populate them as needed with objects. Although these objects are called per cpu objects they cannot be handled in the same way as per cpu objects by adding the per cpu offset of the respective cpu.
The patch here changes that. We create a small memory pool in the percpu area and allocate from there if alloc per cpu is called. As a result we do not need the per cpu pointer arrays for each object. This reduces memory usage and also the cache foot print of allocpercpu users. Also the per cpu objects for a single processor are tightly packed next to each other decreasing cache footprint even further and making it possible to access multiple objects in the same cacheline.
SLUB has the same mechanism implemented. After fixing up the alloccpu stuff we throw the SLUB method out and use the new allocpercpu handling. Then we optimize allocpercpu addressing by adding a new function
this_cpu_ptr()
that allows the determination of the per cpu pointer for the current processor in an more efficient way on many platforms.
This increases the speed of SLUB (and likely other kernel subsystems that benefit from the allocpercpu enhancements):
Note: The per cpu optimization are only half way there because of the screwed up way that x86_64 handles its cpu area that causes addditional cycles to be spend by retrieving a pointer from memory and adding it to the address. The i386 code is much less cycle intensive being able to get to per cpu data using a segment prefix and if we can get that to work on x86_64 then we may be able to get the cycle count for the fastpath down to 20-30 cycles.
[
slub_alloc 7K ] Using allocpercpu removes the needs for the per cpu arrays in the kmem_cache struct. These could get quite big if we have to support system of up to thousands of cpus. The use of alloc_percpu means that:
1. The size of kmem_cache for SMP configuration shrinks since we will only need 1 pointer instead of NR_CPUS. The same pointer can be used by all processors. Reduces cache footprint of the allocator.
2. We can dynamically size kmem_cache according to the actual nodes in the system meaning less memory overhead for configurations that may potentially support up to 1k NUMA nodes.
3. We can remove the diddle widdle with allocating and releasing kmem_cache_cpu structures when bringing up and shuttting down cpus. The allocpercpu logic will do it all for us.
4. Fastpath performance increases by another 20% vs. the earlier improvements. Instead of having fastpath with 40-50 cycles we are now in the 30-40 range.
Signed-off-by: Christoph Lameter <clame...@sgi.com>
#ifdef CONFIG_SMP -/* - * Per cpu array for per cpu structures. - * - * The per cpu array places all kmem_cache_cpu structures from one processor - * close together meaning that it becomes possible that multiple per cpu - * structures are contained in one cacheline. This may be particularly - * beneficial for the kmalloc caches. - * - * A desktop system typically has around 60-80 slabs. With 100 here we are - * likely able to get per cpu structures for all caches from the array defined - * here. We must be able to cover all kmalloc caches during bootstrap. - * - * If the per cpu array is exhausted then fall back to kmalloc - * of individual cachelines. No sharing is possible then. - */ -#define NR_KMEM_CACHE_CPU 100 - -static DEFINE_PER_CPU(struct kmem_cache_cpu, - kmem_cache_cpu)[NR_KMEM_CACHE_CPU]; - -static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free); -static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE; - -static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, - int cpu, gfp_t flags) -{ - struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu); - - if (c) - per_cpu(kmem_cache_cpu_free, cpu) = - (void *)c->freelist; - else { - /* Table overflow: So allocate ourselves */ - c = kmalloc_node( - ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()), - flags, cpu_to_node(cpu)); - if (!c) - return NULL; - } - - init_kmem_cache_cpu(s, c); - return c; -} - -static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu) -{ - if (c < per_cpu(kmem_cache_cpu, cpu) || - c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) { - kfree(c); - return; - } - c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu); - per_cpu(kmem_cache_cpu_free, cpu) = c; -} - static void free_kmem_cache_cpus(struct kmem_cache *s) { - int cpu; - - for_each_online_cpu(cpu) { - struct kmem_cache_cpu *c = get_cpu_slab(s, cpu); - - if (c) { - s->cpu_slab[cpu] = NULL; - free_kmem_cache_cpu(c, cpu); - } - } + percpu_free(s->cpu_slab); }
static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags) { int cpu;
[
slub_reduce 5K ] Remove the fields in kmem_cache_cpu that were used to cache data from kmem_cache when they were in different cachelines. The cacheline that holds the per cpu array pointer now also holds these values. We can cut down the kmem_cache_cpu size to almost half.
The get_freepointer() and set_freepointer() functions that used to be only intended for the slow path now are also useful for the hot path since access to the field does not require an additional cacheline anymore. This results in consistent use of setting the freepointer for objects throughout SLUB.
Signed-off-by: Christoph Lameter <clame...@sgi.com>
-/* - * Slow version of get and set free pointer. - * - * This version requires touching the cache lines of kmem_cache which - * we avoid to do in the fast alloc free paths. There we obtain the offset - * from the page struct. - */ static inline void *get_freepointer(struct kmem_cache *s, void *object) { return *(void **)(object + s->offset); @@ -1446,10 +1439,10 @@ static void deactivate_slab(struct kmem_
static void init_kmem_cache_node(struct kmem_cache_node *n) @@ -3027,21 +3018,12 @@ struct kmem_cache *kmem_cache_create(con down_write(&slub_lock); s = find_mergeable(size, align, flags, name, ctor); if (s) { - int cpu; - s->refcount++; /* * Adjust the object sizes so that we clear * the complete object on kzalloc. */ s->objsize = max(s->objsize, (int)size); - - /* - * And then we need to update the object size in the - * per cpu structures - */ - for_each_online_cpu(cpu) - get_cpu_slab(s, cpu)->objsize = s->objsize; s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *))); up_write(&slub_lock); if (sysfs_slab_alias(s, name))
+static inline struct kmem_cache_cpu *this_cpu_slab(struct kmem_cache *s) +{ +#ifdef CONFIG_SMP + return this_cpu_ptr(s->cpu_slab); +#else + return &s->cpu_slab; +#endif +} + /* * The end pointer in a slab is special. It points to the first object in the * slab but has bit 0 set to mark it. @@ -1521,7 +1530,7 @@ static noinline unsigned long get_new_sl if (!page) return 0;
- *pc = c = get_cpu_slab(s, smp_processor_id()); + *pc = c = this_cpu_slab(s); if (c->page) { /* * Someone else populated the cpu_slab while we @@ -1650,25 +1659,26 @@ static void __always_inline *slab_alloc( struct kmem_cache_cpu *c;
#ifdef CONFIG_FAST_CMPXCHG_LOCAL - c = get_cpu_slab(s, get_cpu()); + preempt_disable(); + c = this_cpu_slab(s); do { object = c->freelist; if (unlikely(is_end(object) || !node_match(c, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); if (unlikely(!object)) { - put_cpu(); + preempt_enable(); goto out; } break; } } while (cmpxchg_local(&c->freelist, object, get_freepointer(s, object)) != object); - put_cpu(); + preempt_enable(); #else unsigned long flags;
local_irq_save(flags); - c = get_cpu_slab(s, smp_processor_id()); + c = this_cpu_slab(s); if (unlikely((is_end(c->freelist)) || !node_match(c, node))) {
[
newallocpercpu 10K ] Currently each call to alloc_percpu allocates an array of pointer to objects. For each operation on a percpu structure we need to follow a pointer from that map. Usually a processor used only the entry for its own processor id in that array. The rest of the bytes in the cacheline are not needed. This repeats itself for each and every per cpu array in use.
Moreover the result of alloc_percpu is not a variable that can be handled like a regular per cpu variable.
The approach here changes the way allocpercpu is done. Objects are placed in preallocated per cpu areas that are indexed via the existing per cpu array of pointers. So we have a single array of pointer to per cpu areas that is used by all per cpu operations. The data is placed tightly next to each other for each processor so that the likelyhood of a single cache line covering data for multiple needs is increased. The cache footprint of the allocpercpu operations sinks dramatically. Some processors have the ability to map the per cpu area of the current processor in a special way so that variables in that area can be reached very efficiently. It is rather typical that a processor only uses its own per processor area. On many architectures the indexing via the per cpu array can then be completely bypassed.
The size of the per cpu alloc area is defined to be 32k per processor for now.
Another advantage of this approach is that the onlining and offlining of the per cpu items is handled in a global way. On onlining a cpu all objects become present without callbacks. Similarly on offlining a cpu all per cpu objects vanish without the need of callbacks. Callbacks may still be needed to do preparation and cleanup of the data areas but the freeing and allocation of the per cpu areas no longer needs to be done by the subsystems.
Signed-off-by: Christoph Lameter <clame...@sgi.com>
Index: linux-2.6/mm/allocpercpu.c =================================================================== --- linux-2.6.orig/mm/allocpercpu.c 2007-10-31 16:39:13.584621383 -0700 +++ linux-2.6/mm/allocpercpu.c 2007-10-31 16:39:15.924121250 -0700 @@ -2,10 +2,140 @@ * linux/mm/allocpercpu.c * * Separated from slab.c August 11, 2006 Christoph Lameter <clame...@sgi.com> + * + * (C) 2007 SGI, Christoph Lameter <clame...@sgi.com> + * Basic implementation with allocation and free from a dedicated per + * cpu area. + * + * The per cpu allocator allows allocation of memory from a statically + * allocated per cpu array and consists of cells of UNIT_SIZE. A byte array + * is used to describe the state of each of the available units that can be + * allocated via cpu_alloc() and freed via cpu_free(). The possible states are: + * + * FREE = The per cpu unit is not allocated + * USED = The per cpu unit is allocated and more units follow. + * END = The last per cpu unit used for an allocation (needed to + * establish the size of the allocation on free) + * + * The per cpu allocator is typically used to allocate small sized object from 8 to 32 + * bytes and it is rarely used. Allocation is looking for the first available object + * in the cpu_alloc_map. If the allocator would be used frequently with varying sizes + * of objects then we may end up with fragmentation. */ #include <linux/mm.h> #include <linux/module.h>
+/* + * Maximum allowed per cpu data per cpu + */ +#define PER_CPU_ALLOC_SIZE 32768 + +#define UNIT_SIZE sizeof(unsigned long long) +#define UNITS_PER_CPU (PER_CPU_ALLOC_SIZE / UNIT_SIZE) + +enum unit_type { FREE, END, USED }; + +static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, }; +static DEFINE_SPINLOCK(cpu_alloc_map_lock); +static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU]; + +#define CPU_DATA_OFFSET ((unsigned long)&per_cpu__cpu_area) + +/* + * How many units are needed for an object of a given size + */ +static int size_to_units(unsigned long size) +{ + return DIV_ROUND_UP(size, UNIT_SIZE); +} + +/* + * Mark an object as used in the cpu_alloc_map + * + * Must hold cpu_alloc_map_lock + */ +static void set_map(int start, int length) +{ + cpu_alloc_map[start + length - 1] = END; + if (length > 1) + memset(cpu_alloc_map + start, USED, length - 1); +} + +/* + * Mark an area as freed. + * + * Must hold cpu_alloc_map_lock + * + * Return the number of units taken up by the object freed. + */ +static int clear_map(int start) +{ + int units = 0; + + while (cpu_alloc_map[start + units] == USED) { + cpu_alloc_map[start + units] = FREE; + units++; + } + BUG_ON(cpu_alloc_map[start] != END); + cpu_alloc_map[start] = FREE; + return units + 1; +} + +/* + * Allocate an object of a certain size + * + * Returns a per cpu pointer that must not be directly used. + */ +static void *cpu_alloc(unsigned long size) +{ + unsigned long start = 0; + int units = size_to_units(size); + unsigned end; + + spin_lock(&cpu_alloc_map_lock); + do { + while (start < UNITS_PER_CPU && + cpu_alloc_map[start] != FREE) + start++; + if (start == UNITS_PER_CPU) + return NULL; + + end = start + 1; + while (end < UNITS_PER_CPU && end - start < units && + cpu_alloc_map[end] == FREE) + end++; + if (end - start == units) + break; + start = end; + } while (1); + + set_map(start, units); + __count_vm_events(ALLOC_PERCPU, units * UNIT_SIZE); + spin_unlock(&cpu_alloc_map_lock); + return (void *)(start * UNIT_SIZE + CPU_DATA_OFFSET); +} + +/* + * Free an object. The pointer must be a per cpu pointer allocated + * via cpu_alloc. + */ +static inline void cpu_free(void *pcpu) +{ + unsigned long start = (unsigned long)pcpu; + int index; + int units; + + BUG_ON(start < CPU_DATA_OFFSET); + index = (start - CPU_DATA_OFFSET) / UNIT_SIZE; + BUG_ON(cpu_alloc_map[index] == FREE || + index >= UNITS_PER_CPU); + + spin_lock(&cpu_alloc_map_lock); + units = clear_map(index); + __count_vm_events(ALLOC_PERCPU, -units * UNIT_SIZE); + spin_unlock(&cpu_alloc_map_lock); +} + /** * percpu_depopulate - depopulate per-cpu data for given cpu * @__pdata: per-cpu data to depopulate @@ -16,10 +146,10 @@ */ void percpu_depopulate(void *__pdata, int cpu) { - struct percpu_data *pdata = __percpu_disguise(__pdata); - - kfree(pdata->ptrs[cpu]); - pdata->ptrs[cpu] = NULL; + /* + * Nothing to do here. Removal can only be effected for all + * per cpu areas of a cpu at once. + */ } EXPORT_SYMBOL_GPL(percpu_depopulate);
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata) /* * Use this to get to a cpu's version of the per-cpu object dynamically * allocated. Non-atomic access to the current CPU's version should * probably be combined with get_cpu()/put_cpu(). */ -#define percpu_ptr(ptr, cpu) \ -({ \ - struct percpu_data *__p = __percpu_disguise(ptr); \ - (__typeof__(ptr))__p->ptrs[(cpu)]; \ +#define percpu_ptr(ptr, cpu) \ +({ \ + void *p = __percpu_disguise(ptr); \ + unsigned long q = per_cpu_offset(cpu); \ + (__typeof__(ptr))(p + q); \ })
[
rm_old_function 8K ] Population and depopulation is no longer needed since newly created per cpu areas will have all the fields needed. Teardown of per cpu areas will remove objects no longer needed.
This basically reverts the API to the way it was before the population and depopulation went in. There is only a single user in the kernel that uses these functions in net/iucv/iucv.c which is S/390 specific.
Remove the useless population and depopulation functions there. In that driver we have the single occurrence of a per cpu allocations that uses GFP flags. The allocation from the DMA zone is required in order to have memory below 2G. But it seems that the per cpu areas are also under 2G so we are fine there.
Signed-off-by: Christoph Lameter <clame...@sgi.com>
-/** - * percpu_depopulate - depopulate per-cpu data for given cpu - * @__pdata: per-cpu data to depopulate - * @cpu: depopulate per-cpu data for this cpu - * - * Depopulating per-cpu data for a cpu going offline would be a typical - * use case. You need to register a cpu hotplug handler for that purpose. - */ -void percpu_depopulate(void *__pdata, int cpu) -{ - /* - * Nothing to do here. Removal can only be effected for all - * per cpu areas of a cpu at once. - */ -} -EXPORT_SYMBOL_GPL(percpu_depopulate); - -/** - * percpu_depopulate_mask - depopulate per-cpu data for some cpu's - * @__pdata: per-cpu data to depopulate - * @mask: depopulate per-cpu data for cpu's selected through mask bits - */ -void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask) -{ - /* - * Nothing to do - */ -} -EXPORT_SYMBOL_GPL(__percpu_depopulate_mask); - -/** - * percpu_populate - populate per-cpu data for given cpu - * @__pdata: per-cpu data to populate further - * @size: size of per-cpu object - * @gfp: may sleep or not etc. - * @cpu: populate per-data for this cpu - * - * Populating per-cpu data for a cpu coming online would be a typical - * use case. You need to register a cpu hotplug handler for that purpose. - * Per-cpu object is populated with zeroed buffer. +/* + * Allocate a per cpu array and zero all the per cpu objects. + * This is the externally visible function. */ -void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu) -{ - int pdata = (unsigned long)__percpu_disguise(__pdata); - void *p = (void *)per_cpu_offset(cpu) + pdata; - - memset(p, 0, size); - return p; -} -EXPORT_SYMBOL_GPL(percpu_populate); - -/** - * percpu_populate_mask - populate per-cpu data for more cpu's - * @__pdata: per-cpu data to populate further - * @size: size of per-cpu object - * @gfp: may sleep or not etc. - * @mask: populate per-cpu data for cpu's selected through mask bits - * - * Per-cpu objects are populated with zeroed buffers. - */ -int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp, - cpumask_t *mask) -{ - cpumask_t populated = CPU_MASK_NONE; - int cpu; - - for_each_cpu_mask(cpu, *mask) - if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) { - __percpu_depopulate_mask(__pdata, &populated); - return -ENOMEM; - } else - cpu_set(cpu, populated); - return 0; -} -EXPORT_SYMBOL_GPL(__percpu_populate_mask); - -/** - * percpu_alloc_mask - initial setup of per-cpu data - * @size: size of per-cpu object - * @gfp: may sleep or not etc. - * @mask: populate per-data for cpu's selected through mask bits - * - * Populating per-cpu data for all online cpu's would be a typical use case, - * which is simplified by the percpu_alloc() wrapper. - * Per-cpu objects are populated with zeroed buffers. - */ -void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask) +void *__alloc_percpu(size_t size) { void *pdata = cpu_alloc(size); void *__pdata = __percpu_disguise(pdata); + int cpu;
/** * percpu_free - final cleanup of per-cpu data * @__pdata: object to clean up - * - * We simply clean up any per-cpu object left. No need for the client to - * track and specify through a bis mask which per-cpu objects are to free. */ void percpu_free(void *__pdata) { Index: linux-2.6/net/iucv/iucv.c =================================================================== --- linux-2.6.orig/net/iucv/iucv.c 2007-10-31 16:39:13.001121287 -0700 +++ linux-2.6/net/iucv/iucv.c 2007-10-31 16:40:14.892121256 -0700 @@ -556,25 +556,6 @@ static int __cpuinit iucv_cpu_notify(str long cpu = (long) hcpu;
switch (action) { - case CPU_UP_PREPARE: - case CPU_UP_PREPARE_FROZEN: - if (!percpu_populate(iucv_irq_data, - sizeof(struct iucv_irq_data), - GFP_KERNEL|GFP_DMA, cpu)) - return NOTIFY_BAD; - if (!percpu_populate(iucv_param, sizeof(union iucv_param), - GFP_KERNEL|GFP_DMA, cpu)) { - percpu_depopulate(iucv_irq_data, cpu); - return NOTIFY_BAD; - } - break; - case CPU_UP_CANCELED: - case CPU_UP_CANCELED_FROZEN: - case CPU_DEAD: - case CPU_DEAD_FROZEN: - percpu_depopulate(iucv_param, cpu); - percpu_depopulate(iucv_irq_data, cpu); - break; case CPU_ONLINE: case CPU_ONLINE_FROZEN: case CPU_DOWN_FAILED: @@ -1617,16 +1598,18 @@ static int __init iucv_init(void) rc = PTR_ERR(iucv_root); goto out_bus; } - /* Note: GFP_DMA used to get memory below 2G */ - iucv_irq_data = percpu_alloc(sizeof(struct iucv_irq_data), - GFP_KERNEL|GFP_DMA); + /* + * Note: GFP_DMA used to get memory below 2G. + * + * The percpu data is below 2G right ? So this should work too -cl? + */ + iucv_irq_data = percpu_alloc(struct iucv_irq_data); if (!iucv_irq_data) { rc = -ENOMEM; goto out_root; } /* Allocate parameter blocks. */ - iucv_param = percpu_alloc(sizeof(union iucv_param), - GFP_KERNEL|GFP_DMA); + iucv_param = percpu_alloc(union iucv_param); if (!iucv_param) { rc = -ENOMEM; goto out_extint;
From: Christoph Lameter <clame...@sgi.com> Date: Wed, 31 Oct 2007 17:53:23 -0700 (PDT)
> > This patch fixes build failures with DEBUG_VM disabled.
> Well there is more there. Last minute mods sigh. With DEBUG_VM you likely > need this patch.
Without DEBUG_VM I get a loop of crashes shortly after SSHD is started, I'll try to track it down. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter <clame...@sgi.com> Date: Wed, 31 Oct 2007 18:01:34 -0700 (PDT)
> On Wed, 31 Oct 2007, David Miller wrote:
> > Without DEBUG_VM I get a loop of crashes shortly after SSHD > > is started, I'll try to track it down.
> Check how much per cpu memory is in use by
> cat /proc/vmstat
> currently we have a 32k limit there.
It crashes when SSHD starts, the serial console GETTY hasn't started up yet so I can't even log in to run those commands Christoph.
All I can do now is bisect and then try to figure out what about the guilty change might cause the problem.
This is on a 64-cpu sparc64 box, and fast cmpxchg local is not set, so maybe it's one of the locking changes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter <clame...@sgi.com> Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)
> On Wed, 31 Oct 2007, David Miller wrote:
> > All I can do now is bisect and then try to figure out what about the > > guilty change might cause the problem.
> Reverting the 7th patch should avoid using the sparc register that caches > the per cpu area offset? (I though so, does it?)
Yes, that's right, %g5 holds the local cpu's per-cpu offset. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 31 Oct 2007, David Miller wrote: > It crashes when SSHD starts, the serial console GETTY hasn't > started up yet so I can't even log in to run those commands > Christoph.
Hmmm... Bad.
> All I can do now is bisect and then try to figure out what about the > guilty change might cause the problem.
Reverting the 7th patch should avoid using the sparc register that caches the per cpu area offset? (I though so, does it?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
-#define PERCPU_PAGE_SHIFT 16 /* log2() of max. size of per-CPU area */ +#define PERCPU_PAGE_SHIFT 20 /* log2() of max. size of per-CPU area */ #define PERCPU_PAGE_SIZE (__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)
/* + * This will make per cpu access to the local area use the virtually mapped + * areas. + */ +#define this_cpu_offset() 0 + +/* * Pretty much a literal copy of asm-generic/percpu.h, except that percpu_modcopy() is an * external routine, to avoid include-hell. */ @@ -51,8 +57,6 @@ extern unsigned long __per_cpu_offset[NR /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */ DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
From: Christoph Lameter <clame...@sgi.com> Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)
> /* > * Maximum allowed per cpu data per cpu > */ > +#ifdef CONFIG_NUMA > +#define PER_CPU_ALLOC_SIZE (32768 + MAX_NUMNODES * 512) > +#else > #define PER_CPU_ALLOC_SIZE 32768 > +#endif > +
Christoph, as Rusty found out years ago when he first wrote this code, you cannot put hard limits on the alloc_percpu() allocations.
They can be done by anyone, any module, and since there was no limit before you cannot reasonably add one now.
As just one of many examples, several networking devices use alloc_percpu() for each instance they bring up. This alone can request arbitrary amounts of per-cpu data.
Therefore, you'll need to do your optimization without imposing any size limits. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
This hunk helped the sparc64 looping OOPS I was getting, but cpus hang in some other fashion soon afterwards.
I'll try to debug this some more later, I've dumped enough time into this already :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> This patch increases the speed of the SLUB fastpath by > improving the per cpu allocator and makes it usable for SLUB.
> Currently allocpercpu manages arrays of pointer to per cpu objects. > This means that is has to allocate the arrays and then populate them > as needed with objects. Although these objects are called per cpu > objects they cannot be handled in the same way as per cpu objects > by adding the per cpu offset of the respective cpu.
> The patch here changes that. We create a small memory pool in the > percpu area and allocate from there if alloc per cpu is called. > As a result we do not need the per cpu pointer arrays for each > object. This reduces memory usage and also the cache foot print > of allocpercpu users. Also the per cpu objects for a single processor > are tightly packed next to each other decreasing cache footprint > even further and making it possible to access multiple objects > in the same cacheline.
> SLUB has the same mechanism implemented. After fixing up the > alloccpu stuff we throw the SLUB method out and use the new > allocpercpu handling. Then we optimize allocpercpu addressing > by adding a new function
> this_cpu_ptr()
> that allows the determination of the per cpu pointer for the > current processor in an more efficient way on many platforms.
> This increases the speed of SLUB (and likely other kernel subsystems > that benefit from the allocpercpu enhancements):
> Note: The per cpu optimization are only half way there because of the screwed > up way that x86_64 handles its cpu area that causes addditional cycles to be > spend by retrieving a pointer from memory and adding it to the address. > The i386 code is much less cycle intensive being able to get to per cpu > data using a segment prefix and if we can get that to work on x86_64 > then we may be able to get the cycle count for the fastpath down to 20-30 > cycles.
Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is not enough because many things will use per_cpu if only per_cpu was reasonably fast (ie not so many dereferences)
I think this question already came in the past and Linus already answered it, but I again ask it. What about VM games with modern cpus (64 bits arches)
Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout so that each cpu maps its own per_cpu area on this area, so that the local per_cpu data sits in the same virtual address on each cpu. Then we dont need a segment prefix nor adding a 'per_cpu offset'. No need to write special asm functions to read/write/increment a per_cpu data and gcc could use normal rules for optimizations.
We only would need adding "per_cpu offset" to get data for a given cpu.