开发者

What is the most efficient way to store a series of values with unknown size?

开发者 https://www.devze.com 2023-02-14 06:08 出处:网络
I have a function (say, named next_entity), that generates size_t values. The function acts as a generator, that is, it produces a new value on each call, and, finally, returns 0 as a sentinel.

I have a function (say, named next_entity), that generates size_t values. The function acts as a generator, that is, it produces a new value on each call, and, finally, returns 0 as a sentinel.

In another function, that calls next_entity, I need to store the values somewhere. I don't know the amount of the values before I receive the sentinel, so I cannot malloc开发者_开发知识库 or statically allocate an array to contain these values before they come.

The thing is, that after the sentinel comes, the only thing I need to do with the values is to save them to a file, but with no repetitions, that is, each value must occur only once.

After that, the values are no more needed.

I tried to use GHashTable from glib.h during the iteration, to store the values as the keys, but the problem with GHashTable is that the pointers to the keys passed to the function g_hash_table_insert must remain alive during the lifecycle of the hash-table, so I have to do kind of malloc(sizeof(size_t)) for each new value.

It works, but it seems to be quite inefficient, because malloc is time-consuming.

Is there any better way to do that?

I can post the real code if it is needed, but I don't think the problem is about the code.

Any help would be appreciated, thank you in advance!


If you data size is not gigabytes, you can do with a dynamically array, which you double in size with realloc()ing every time you run out of space. With this strategy only log(N) reallocations will have to happen.

In C++, for example, a lot of std::vector implementations usually do just that.


Others have suggested realloc. Instead of that, consider using a linked list of malloced blocks, where each block contains a largish array of numbers; when you fill up a block, allocate another one and link the previous one to it ... or link it to the previous one and then reverse the list before outputing them. The point is that you don't need to store your values in contiguous memory, so you don't need to use realloc. realloc copies the previous array, which you can avoid and which will be slow for very large arrays, and it could even run out of memory if it can't find a large enough contiguous block, whereas allocating individual blocks is more resilient. The downside is that it takes more work to manage the linked list.

==== Edit for GHashTable usage:

Store your value into the array and pass the address of that element to the hash routine ... if it's not already present, advance the array pointer. To output the values, just enumerate the keys from the hash table. The linked list of arrays is only needed for deallocating them. If this is all the program does, then you don't even need to maintain a linked list; you can just allocate arrays as you need them, and they will all get deallocated when your program exits.


The keys in a hash table are a void*, and a void* is always at least as large as a size_t.

All you need to do is, instead of malloc(sizeof(size_t)), use g_hash_table_new(NULL,NULL) to use g_direct_hash as the hash method. Then do this:

g_hash_table_replace(table, GSIZE_TO_POINTER(value), GSIZE_TO_POINTER(value))

To iterate over keys, use GPOINTER_TO_SIZE to get back to the size_t.

You can always do this for any integer type, instead of malloc'ing it. (use GINT_TO_POINTER, GUINT_TO_POINTER, instead of GSIZE_TO_POINTER when appropriate)


The usual way is to multiply the space used by a constant (2, 1.69, 1.618, 1.5, ...).

I like the golden ratio :)

arr = malloc(elems * sizeof *arr);
{
    /* ... */
    elems = elems * 13 / 8; /* approximate golden ratio */
    tmparr = realloc(arr, elems * sizeof *arr);
    if (tmparr == NULL) /* deal with error */;
    arr = tmparr;
    /* ... */
}
free(arr);
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号