Wednesday, November 25, 2015

Creating MongoDB Shards and Replica Sets with PowerShell

In the MongoDB University course "M101N: MongoDB for .NET Developers", there is a walk through of a UNIX script to create a 3x3 set of shards and replica sets. To brush up on my MongoDB and my Powershell, I decided this weekend to try to rewrite and extend the script in Powershell. A somewhat painful, but ultimately rewarding, experience: You can see what I came up with here.  In this post I will walk through some of the powershell and some of the MongoDB syntax and tricks.

General Approach

I wanted to be able to run the script multiple times, so I decided to put my data directories under c:\temp, and to delete them at the start of the run.  Similarly, I spawn all the "mongod" processes as windows, so they are easy to kill with a right click on the task bar.  (I didn't want to kill all mongod processes, because I didn't want to touch the one running as a service, that supports my Development Sitecore instances.)  Also, I used simple values for my port numbers: the mogod processes run as 30000 to 30008, the configuration servers as 40000, 400001, 400002, and the mongos (which functions as a router in a sharded environment) on port 50000.  I also have added some diagnostics to check on statuses of various steps in the processed, rather than simply waiting for an arbitrary 60 seconds, as the original script does.

Clean Up, Set Up

The script begins by establishing a temporary directory, cleaning up an old copy and creating an output function, "report", to facilitate nicely formatted status reporting.  The output is piped to Out-Null to keep the output stream clean.

$rootpath = "/temp/mongoshards/"

new-module  -scriptblock {function report($text) {
write-output $("-" * $text.length)
write-output $text
write-output $("-" * $text.length)
write-output ""
}}  | Out-Null

report "Remove temporary directory"

remove-item $rootpath -recurse 

report "Create data directories"

new-item -type directory -path $rootpath | Out-Null

report "Create mongod instances"

Creating the Mongod processes

The logic to create the mongod processes is pretty straight-forward:

report "Create mongod instances"

$shards = 0..2
foreach ($shard in $shards)
  $rss = 0..2
  foreach ($rs in $rss)
    $dbpath = "$rootpath/data/shard${shard}/r${rs}"
    new-item -type directory -path $dbpath | Out-Null
    # Start mongod processes
    $port = 30000 + ($shard * 3) + $rs
    $args = "--replSet s$shard --logpath $rootpath/s${shard}_r${rs}.log --dbpath $dbpath --port $port --oplogSize 64 --smallfiles"
    $process = start-process mongod.exe $args  

The only trickiness here is the variable substitution, leading to paths like "data/shard0/r1", and the logic to create the port numbers, 30000,30001,30002 for the shard 0 processes, 30003-30005 for s1, and 30006-30008 for s2.  Of course, these are not yet replica sets; we handle that next.

Creating the Replica Sets

This is done by creating a config document and passing it to rs.initialize.
report "Configure replica sets"
  $port1 = 30000 + $shard * 3
  $port2 = 30000 + $shard * 3 + 1     
  $port3 = 30000 + $shard * 3 + 2
  $configBlock = "{_id: ""s$shard"", members: [ {_id:0, host:""localhost:$port1""}, {_id:1, host:""localhost:$port2""}, {_id:2, host:""localhost:$port3""}]}"
  echo "rs.initiate($configBlock)" | mongo --port $port1  

The echo "javascript" | mongo is a nice bit of syntax I picked up from the course, and simplifies passing MongoDB commands from a script.  Since it takes a little while for a server to win an election and become a PRIMARY, we set up a one second loop to look for this event:

report "Check PRIMARY elected for each replica set"
while ($True)
  $response1 = (echo "rs.status()" | mongo -port 30000)
  $response2 = (echo "rs.status()" | mongo -port 30003)
  $response3 = (echo "rs.status()" | mongo -port 30006)

  if (($response1 -clike "*PRIMARY*") -and ($response2 -clike "*PRIMARY*") -and ($response3 -clike "*PRIMARY*")) {
  Start-Sleep -s 1
  Write-Output "."
report "PRIMARY elected"

Note that redirected output creates an array of strings, and the comparison operator -clike checks for a case sensitive match for any member of such an array.

Creating the Shards

Two steps are left to create the shards.  First, we need to create the configuration servers that will store which records go where, and then we need to define each replica set as a shard.  Finally, we need to specify the collection and key that will be used for sharding the data
report "Create config servers"
$cfg_a = "${rootpath}/data/config_a"
$cfg_b = "${rootpath}/data/config_b"
$cfg_c = "${rootpath}/data/config_c"

new-item -type directory -path $cfg_a
new-item -type directory -path $cfg_b
new-item -type directory -path $cfg_c

$arg_a = "--dbpath $cfg_a --logpath ${rootpath}/cfg-a.log --configsvr --smallfiles --port 40000"
$arg_b = "--dbpath $cfg_b --logpath ${rootpath}/cfg-b.log --configsvr --smallfiles --port 40001"
$arg_c = "--dbpath $cfg_c --logpath ${rootpath}/cfg-c.log --configsvr --smallfiles --port 40002"

start-process mongod $arg_a
start-process mongod $arg_b
start-process mongod $arg_c

report "Config servers up"

Two configuration servers stores the definitive version of what data resides where; the mongos instances keep this data in memory.

Once the configuration servers are set up, the next step is to add the shards. Note the step to make sure that port 50000 is on line.  Basically, if the response does not contains a line with the word "failed", the server is treated as on-line.

report "Launch mongos"
$args_s = "--port 50000 --logpath ${rootpath}/mongos-1.log --configdb localhost:40000,localhost:40001,localhost:40002"
start-process mongos $args_s

report "Check mongos online on port 50000"

  $output = echo "" |  mongo localhost:50000 2> null
  if (-not ($output -like "*failed*")) {break} 
  Start-Sleep -s 1
  Write-Output "."
report "Mongos avaiable at port 50000"

report "Configure shards"

echo "db.adminCommand( { addshard: ""s0/localhost:30000"" })" | mongo  --quiet  --port 50000 

echo "db.adminCommand( { addshard: ""s1/localhost:30003"" })" | mongo  --quiet --port 50000 

echo "db.adminCommand( { addshard: ""s2/localhost:30006"" })" | mongo  --quiet --port 50000 

echo "db.adminCommand( { enableSharding:""school"" })" | mongo --port 50000 

echo "db.adminCommand( { shardCollection:""school.students"", key:{student_id:1} })" | mongo --port 50000 

Loading some data

To get some data, I use a short Javascript (from MongoDB University) that pushes a list of students and course grades.    Once this is done, I display the counts for the combined shared collection, and for each of the specific shards, and the output of sh.status(), which shows the breakpoints that MongoDB is using to distribute data. 

report "Generate 100,000 documents" 

$mongoUniversityScript = "db=db.getSiblingDB(`"school`");
types = ['exam', 'quiz', 'homework', 'homework'];
// 10,000 students
for (i = 0; i < 10000; i++) {

    // take 10 classes
    for (class_counter = 0; class_counter < 10; class_counter ++) {
 scores = []
     // and each class has 4 grades
     for (j = 0; j < 4; j++) {

 // there are 500 different classes that they can take
 class_id = Math.floor(Math.random()*501); // get a class id between 0 and 500

 record = {'student_id':i, 'scores':scores, 'class_id':class_id};



echo $mongoUniversityScript | mongo --port 50000 --quiet

report "Total records, records in shard 1, 2, and 3"

echo "db.students.count()" | mongo school --port 50000

echo "db.students.count()" | mongo school --port 30000

echo "db.students.count()" | mongo school --port 30003

echo "db.students.count()" | mongo school --port 30006

report "sh.status() output" 

echo "sh.status()" | mongo --port 50000

Again, I have a link to the full script at the top of the page.  I'm a rank beginner at PowerShell, so please feel free to make suggestions about how style and substance could be improved.  

No comments:

Post a Comment