Disk I/O & size performance












1















I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.



Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.



However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?



To summarize, what is the best for I/O and space optimization between:




  • A lot of small files

  • A few big files










share|improve this question




















  • 1





    Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

    – davidgo
    Jan 27 at 18:57











  • All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

    – xenoid
    Jan 27 at 19:48











  • @davidgo Which databases are suitable for that?

    – o2640110
    Jan 27 at 20:50











  • That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

    – davidgo
    Jan 27 at 21:06











  • Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

    – davidgo
    Jan 27 at 21:08
















1















I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.



Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.



However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?



To summarize, what is the best for I/O and space optimization between:




  • A lot of small files

  • A few big files










share|improve this question




















  • 1





    Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

    – davidgo
    Jan 27 at 18:57











  • All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

    – xenoid
    Jan 27 at 19:48











  • @davidgo Which databases are suitable for that?

    – o2640110
    Jan 27 at 20:50











  • That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

    – davidgo
    Jan 27 at 21:06











  • Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

    – davidgo
    Jan 27 at 21:08














1












1








1








I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.



Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.



However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?



To summarize, what is the best for I/O and space optimization between:




  • A lot of small files

  • A few big files










share|improve this question
















I have a process that needs to do a lot of disk writes but no read. I can either write a lot of small files (~1,000,000,000 files, what I'm currently doing) or write a few big files.



Small files are ~2Kb on average but as I have a 4096b block size, I'm loosing about one half on my disk capacity.



However, as a lot of threads needs to write in the as the same time, wouldn't big files be a bottleneck since each thread will need to open the file, write, then close it?



To summarize, what is the best for I/O and space optimization between:




  • A lot of small files

  • A few big files







linux hard-drive






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 28 at 0:38







o2640110

















asked Jan 27 at 17:27









o2640110o2640110

64




64








  • 1





    Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

    – davidgo
    Jan 27 at 18:57











  • All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

    – xenoid
    Jan 27 at 19:48











  • @davidgo Which databases are suitable for that?

    – o2640110
    Jan 27 at 20:50











  • That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

    – davidgo
    Jan 27 at 21:06











  • Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

    – davidgo
    Jan 27 at 21:08














  • 1





    Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

    – davidgo
    Jan 27 at 18:57











  • All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

    – xenoid
    Jan 27 at 19:48











  • @davidgo Which databases are suitable for that?

    – o2640110
    Jan 27 at 20:50











  • That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

    – davidgo
    Jan 27 at 21:06











  • Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

    – davidgo
    Jan 27 at 21:08








1




1





Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

– davidgo
Jan 27 at 18:57





Why not use a database instead - this will abstract your problem to a system designed to handle writes,caching etc.

– davidgo
Jan 27 at 18:57













All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

– xenoid
Jan 27 at 19:48





All threads in a process use the same file descriptor (which belongs to the process, not to individual threads).

– xenoid
Jan 27 at 19:48













@davidgo Which databases are suitable for that?

– o2640110
Jan 27 at 20:50





@davidgo Which databases are suitable for that?

– o2640110
Jan 27 at 20:50













That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

– davidgo
Jan 27 at 21:06





That really depends on your data - especially how data is going to be retrieved and worked with. For a standard sql type database look into mysql or postgres, but depending on the nature of your data, risks related to lost data etc a nosql database (which Im not that familiar with but come in lots of flavours) may be better - en.m.wikipedia.org/wiki/NoSQL

– davidgo
Jan 27 at 21:06













Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

– davidgo
Jan 27 at 21:08





Databases do not use file descriptors (in the interface with programs) - typical sql databases treay each row seperately unless you add locks or constraints.

– davidgo
Jan 27 at 21:08










1 Answer
1






active

oldest

votes


















2














The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.



Another possibility is to write to a RAM-disk, and periodically move data to the HDD.



Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.






share|improve this answer
























  • I didn't know about write caching, quite interesting

    – o2640110
    Jan 27 at 23:39











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1398984%2fdisk-i-o-size-performance%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.



Another possibility is to write to a RAM-disk, and periodically move data to the HDD.



Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.






share|improve this answer
























  • I didn't know about write caching, quite interesting

    – o2640110
    Jan 27 at 23:39
















2














The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.



Another possibility is to write to a RAM-disk, and periodically move data to the HDD.



Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.






share|improve this answer
























  • I didn't know about write caching, quite interesting

    – o2640110
    Jan 27 at 23:39














2












2








2







The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.



Another possibility is to write to a RAM-disk, and periodically move data to the HDD.



Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.






share|improve this answer













The easiest might be to let write-caching determine how frequently actual HDD (or SSD) writes are made. You can turn write-caching on or off at the OS level, or experiment with various hdparam settings. This enables tuning without altering your application. See Unix StackExchange on tuning.



Another possibility is to write to a RAM-disk, and periodically move data to the HDD.



Caveat: Increasing write latency increases the possibility of data loss, though if you're using a laptop or PC with UPS, that might not be an issue.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 27 at 18:03









DrMoishe PippikDrMoishe Pippik

10.4k21432




10.4k21432













  • I didn't know about write caching, quite interesting

    – o2640110
    Jan 27 at 23:39



















  • I didn't know about write caching, quite interesting

    – o2640110
    Jan 27 at 23:39

















I didn't know about write caching, quite interesting

– o2640110
Jan 27 at 23:39





I didn't know about write caching, quite interesting

– o2640110
Jan 27 at 23:39


















draft saved

draft discarded




















































Thanks for contributing an answer to Super User!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1398984%2fdisk-i-o-size-performance%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Сан-Квентин

8-я гвардейская общевойсковая армия

Алькесар