Here is the premise of this blog entry. I needed to discover the database size of a meta data database with full index data in it for a project I am working on. I am trying to discover worst case scenario kind of numbers so I don't hit a wall unexpectedly at some point in time when someone pushes the boundaries of the system.
The first test that I had done was very simple, generate a single 5GB file containing random ascii. The reason I went this route was because I wanted to ensure that the index engine I am using wasn't able to reduce the data set. It does vector based indexing and common word elimination so the completely random data represented in the file wouldn't be able to be reduced by the vector index in anyway that would reduce the database size. This process is awesome for space saving and in the real world does an amazing job at what it does, but I needed to defeat it for my test in case someone else introduced data to my system that would push the limits.
My simple approach to creating this test file was to us this command. BTW I am using ubuntu 9.10 so before this command would work I needed to install binutils using "sudo apt-get install binutils"
cat /dev/urandom |strings > file.txt
I then opened a second terminal window to monitor the file size manually till it got to the size that I needed it to be. Simple enough...
I ran the data into the system, after the test looked at my results and thought to myself "This can't be right". I then realized where the failure was in my test. Quite simply every single file generated would also generate a row in the meta data database. At this point I only generated a single row of data in the database for a 5GB file. This is not very real world or worst case.
I then started working on the problem from a more controlled approach since I realized I was going to have to generate a good deal more data in smaller batches. The first step was to figure out how to generate an ascii document of a specific size since my earlier approach gave absolutely no control over the size of the file other than through manual intervention. Anyone who has worked with more than a couple files here understands this would be impossible to generate the data set in the number of files I was going to need.
So the next thing I came up with was to use dd to create a file of the proper size directing the output of /dev/urandom into a tmp file then using that tmp file to collect my data.
My script started to look like this, I needed to make the script sleep for a period of time because I was having issues getting enough data into the tmp file before I would try and collect it for the dd.
cat /dev/urandom | strings >tmp &sleep .25dd if=tmp of=testdata/n.txt bs=32k count=1kill `ps -ef | grep /dev/urandom | nawk '{print $2}'`rm tmp
This gave me a single file of 32k full of random ascii inside a directory labeled testdata that I had previously created to try and keep this mess organized for cleaning up the testing of this script. You will notice the .txt extension on my file name, that was for my specific testing purposes as I identify my document types using extensions at this point when deciding how to handle them, which is a completely different subject.
After getting this all working it was a simple matter of putting some controls in place to generate the number of files I needed in the size that I needed them and making the output of the script easier to deal with. I made it so I could use the command line to input my variables for file generation for those two variables. I ultimately ended up with a script that looked like this.
#/bin/sh#The purpose of this script is to generate a set of files that are of a specific file size#filled with random ascii data as to create a dataset for testing against#deduplication and tsv indexing algorithms, specifically creating a worst case#scenario as it pertains to the size of meta data databases.#To use this script from the command line you can input the variables for#the number of files you want generated and the size of those files in kb.#Your command should look like "mkfiles.sh filecount filesize" or#"mkfiles.sh 163840 32" This command would generate 163,840#32k files or 5GB of data.#If the files being created but are smaller than the intended file size you may have#to increase the sleep timer because it is not generating enough data into the tmp#file to properly fill the tmp file before dd attempts to extract data from the file.i=0fs=$2bs=$fs'k'mkdir testdatauntil [ $i -eq $1 ]docat /dev/urandom | strings > tmp &sleep .25echo "Making file " '#'$i "of size "$bsdd if=tmp of=testdata/$i.txt bs=$bs count=1 1>/dev/null 2>/dev/nullkill `ps -ef | grep /dev/urandom | nawk '{print $2}'` 2> /dev/nullrm tmpi=$(($i + 1))done
I do owe a friend of mine some kudo's for helping with my clean up here as there was a bunch of noise being propagated all over the screen that made the output of the script look like everything was blowing up. Plus he helped me make this script more usable in general purpose, instead of just in my narrowly focused problem.
I'm wondering why you had to create an intermediate file and not just pipe the output of strings directly into dd like this:
ReplyDeletecat /dev/urandom | strings | dd of=testdata/n.txt bs=32k count=1
It looks a lot simpler this way. Is there some reason why it wouldn't work for you?
I initially attempted to do it that way but I ran into an issue where the size of the files I was generating were always smaller than what I was attempting to create.
ReplyDeleteIf you cat the output of /dev/urandom you will see it actually generates data in block. Its not a steady stream of random data that is available to the system.
It was confusing to see that dd wasn't actually creating the proper file sizes, the closest I could tell was that dd was being starved for data and would end because its source didn't have any more data to put into the file. It essentially copied all of the data it could and finished with its task.
You will notice that there is a .25 sleep in the script the reason that is there is to allow /dev/urandom enough time to fill the tmp file before dd tries to use the file. This allowed enough data to be put into the file so when dd used the file it would generate a file of the proper size.
I hope that answers your question as to why that didn't/work for me. It was a rather interesting problem and part of the reason I wrote a blog on the topic.
- Dean
reebok outlet
ReplyDeletevans shoes
nike air max 90
tiffany jewelry
micahel kors
canada goose uk
michael kors purses
toms outlet
ray ban sunglasses
reebok shoes
rolex submariner
nike store uk
ray bans
hermes belt
michael kors handbags outlet
ralph lauren outlet
louis vuitton handbags
mlb jerseys wholesale
michael kors outlet
instyler max
moncler outlet store
hollister uk
christian louboutin shoes
wholesale nike shoes
oakley sunglasses outlet
coach factory outlet
michael kors handbags
moncler sale
louis vuitton purse
ray ban outlet store
cheap jordan shoes
canada goose jackets
201685yuanyuan
20161017 junda
ReplyDeletelouis vuitton outlet
cheap ugg boots
the north face jacket
hollister clothing
rolex watches
true religion outlet
ferragamo shoes
asics
uggs outlet
uggs on sale
jianbin1208
ReplyDeleteralph lauren uk
sac louis vuitton pas cher
michael kors outlet
rolex watches
ray ban sunglasses
ugg sale
louis vuitton bags cheap
fitflops shoes
michael kors outlet
michael kors handbags
20170224 junda
ReplyDeletecoach outlet online coach factory outlet
louis vuitton handbags
jordan shoes
cheap oakley sunglasses
nike air max 2015
michael kors outlet online
lacoste polo shirts
michael kors handbags
gucci handbags
ugg uk
true religion jeans sale
ReplyDeletekate spade handbags
ray ban sunglasses outlet
polo ralph lauren
hermes birkin bag
louis vuitton
tory burch outlet online store
cheap jordan shoes
oakley sunglasses wholesale
louis vuitton
20170605lck
20180706xioake
ReplyDeletechristian louboutin
pandora jewelry
christian louboutin
cheap jordan shoes
michael kors outlet online
calvin klein jeans
ugg outlet online
cheap oakley sunglasses
oakley sunglasses wholesale
polo outlet
adidas
ReplyDeletejordan 32
jordan 13
christian louboutin
designer sunglasses
nike free
nike store
pandora charms
stuart weitzman
coach outlet
2018.11.13chenlixiang
I thank you for the information and articles you provided
ReplyDeletelook at here now useful content look at more info you can find out more navigate to this site official site
ReplyDeletepop over to these guys replica gucci bags visit this site right here luxury replica bags more helpful hints Dolabuy Chloe
ReplyDelete