Blog

Engineering

Making Scripting Safer

by Sashank Thupukari on August 20th, 2019

Making Scripting Safer

When building products at a fast-growing company, there is rarely the time to build rich internal websites and tooling for every business need. Often, we quickly scaffold scripts and command line utilities to get something done.

Scale is the API for high quality training data, providing the infrastructure for our customers to build AI applications. We deal with an extremely large amount of data, complex data pipelines, and customer-specific feature requests.

One way we’ve managed to move fast and experiment is with scripts. In this post, we explore how we use scripts in our engineering teams and enabled engineers to use scripting effectively and safely.

Why is scripting useful?

A script is usually a single file containing code that is run manually by an engineer. This is in contrast to code that is run using an interface, cron job, or any other traditional triggers. This might seem like a simple concept, but enabling teams to leverage scripts effectively is crucial to helping them move quickly.

At Scale, we aim to be data-driven. We want to accurately understand details about how our customers are using our product, the state of complex data pipelines, and trends of important metrics. In order to enable a fast-moving and data-focused culture, we need to provide teams with an interface to quickly and easily write code to answer their various questions and take immediate actions. We found that a scripting framework was a valuable tool for this use case and prototyping. Once scripts are used frequently, we move their functionality into internal tooling.

A few common use cases we’ve seen for scripts:

→ Build initial implementation for complex analytics

In many situations, we can simply write a raw database query to find the data we’re looking for. We use MongoDB as our primary data store and have built a trailer to replicate our database in AWS Athena to make it easy to execute fast ad-hoc Presto queries on our data. We’ve also adopted Mode Analytics as an SQL interface and visualization tool, and built tooling to make it easy to run scheduled queries.

Sometimes, standalone queries are insufficient to get answers to complex data questions, and we need more than the raw data or database filters and aggregations over it . We need to hook into our codebase to use the models and helpers we’ve built to help calculate and report deeper insights, and to be able to pull from different data sources other than our database.

For example, scripts have been useful for prototyping analytics over S3 objects. We use S3 to store LiDAR point cloud data, and calculations such as how many LiDAR points are within a specific cuboid annotation and the number of cuboid paths in a subsection of a LiDAR scene are already implemented in our codebase. Another example is reconciling data from the Stripe API with our internal records to determine how much a customer has already paid.

Scripts enable you to run multiple queries on different sources, easily manipulate and massage the data, and calculate important statistics and metrics for analytics purposes, all with the same code we use throughout the rest of the codebase.

If we eventually want a script’s resulting analytics to be monitored, we pull the code from the script into our codebase, set it to run based on an appropriate trigger (such as hitting an endpoint or a cron), and cache the output in MongoDB.

→ Perform one-off operations on data

One-off database operations, such as updating fields based on a filter or creating a large number of new objects, are most often best served by scripts - especially when working with sensitive data and you need the ability to revert the operation easily. Scripts can log outputs in a form that makes it easy to track modifications to data and rollback the changes.

For example, an example pattern that shows up is:

// Object B has a reference to an object of type A, 
// stored under field a_id
// We want to change the reference from A_1 to A_2

1. b = Get object B
2. a_1 = b.a_id             // Originally ref to A_1
2. a_2 = Create object A_2
3. b.a_id = a_2._id
4. Log `B ${b._id} changed a_id from ${a_1._id} to ${a_2._id}`

If we were creating changing a large number of these, our script would log all the changes. If we needed to revert the operation, we could easily parse the output and switch the references back to the original values.

→ Quickly automate workflows

Many of our workflows and products start off as a script that is run manually. Through experimentation and refinement they are eventually turned into fully-fledged internal products. This helps avoid premature optimization and overengineering, and encourages iteratively solving problems.

For example, we initially ran customer billing through scripts. We had our billing workflow defined in a few scripts, and an engineer would run and review the results manually on a weekly basis, while making sure everything was working as expected. As we gained confidence in the system and worked out any issues, we built it into a standalone product and now have it running automatically as a production system.

→ Perform Backfills

Every engineering team eventually has to perform backfills on data. When building new features, we compute new values and metrics for those features going forwards, but sometimes have the need to have those values for historical data as well. In other cases, we find bugs that cause corrupted data and have to perform a backfill to recompute and fix the database.

The need for a scripting framework

As we grew our engineering team, we came across several issues with running arbitrary scripts directly against production data. When we write software in our codebase, we follow these engineering best practices:

  • Using version control for code traceability and visibility into changes
  • Thorough testing (e.g. unit, integration, regression) to ensure correctness
  • Logging for error reporting, monitoring, and debugging issues

When it came to scripts, these core practices usually went out the door. We recognized the need and importance of scripts, but we were increasingly cautious of the risks to data integrity and system health of using scripts as we scaled. To enable engineers to use scripting while maintaining the confidence of normal code, we set out to build a robust scripting framework.

We had the following requirements for our framework:

  • Maintain low friction to easily run scripts
  • Allow scripts to be run against the correct environments (dev, staging, production)
  • Allow dry runs of scripts (where updates to DB are not persisted) to ensure correctness
  • Provide global visibility into the history of all scripts run across the organization, so that we know what operations are being run and who they are being run by
  • Provide a logging interface to trace exactly what data scripts modified to assist in debugging issues

The last two requirements of the framework gave us the biggest wins. When facing our biggest production DB issue to date, referring to the script logs and source code of the script made a huge difference in being able to fix the issue correctly and efficiently. In the next section, we’ll show how we implemented script execution, logging, and storage in the framework to create this lightweight but extremely useful abstraction.

Design

Here, we walk through the complete design and implementation of our framework. Our primary stack is Node.js and MongoDB, but these concepts can be easily applied to any technologies.

Executable Function Wrapper

First, we define a wrapper run function, which will help us encapsulate executable functions with proper error logging and exit codes. We can think of this as our “main” function, or an entry point to start execution of a script.

// scriptMain.js 

const run = async (f) => {
  try {
    console.log('Starting script...');
    await f();
    process.exit(0);
  } catch (err) {
    console.error(err.stack);
    process.exit(1);
  }
};

export default { run };

Now, in our script, we can import run and define our script actions:

import scriptMain from '../scriptMain.js';

scriptMain.main(async () => {
    // Script actions go here
})

Free benefit of testing: Since our script is its own file with the main method wrapped, we can write helper functions in our scripts that we can export and write unit tests for.

Script Executer

Now that we have our script wrapper defined, we can create a script runner, which will:

  • Take as input which environment (production, staging, or development) the script is run on
  • Save a commit message and metadata for the script execution
  • Run the script
  • Save the output logs of the script
  • Enable a read-only mode (which restricts writes to the DB)

Determining environment

In package.json, we define the the following different script running commands. The -ro suffix stands for “read-only”.

"scripts": {
  "script-prod": "env $(cat .env.prod .env.global | xargs) SCRIPT=1 node /utils/run-script.js prod",
  "script-staging": "env $(cat .env.prod .env.global | xargs) SCRIPT=1 node /utils/run-script.js staging",
  "script-local": "env $(cat .env.local .env.global | xargs) SCRIPT=1 node /utils/run-script.js staging",
  
  "script-prod-ro": "env $(cat .env.prod .env.global | xargs) MONGO_RO=1 SCRIPT=1 node /utils/run-script.js prod",
  "script-staging-ro": "env $(cat .env.prod .env.global | xargs) MONGO_RO=1 SCRIPT=1 node /utils/run-script.js staging",
  "script-local-ro": "env $(cat .env.local .env.global | xargs) MONGO_RO=1 SCRIPT=1 node /utils/run-script.js staging",
}

To break it down:

# Environment variables:
# Set the appropriate environment variables for running the script. 
# We have a .env.global base file, with environment specific (local, staging, 
# production files. 
# Piping through xargs removes new lines
env $(cat .env.prod .env.global | xargs)

# Script Environment Variable:
# This is simply to allow code to detect whether we are running within a script. 
# This allows us to prevent certain functions or access to data from scripts in 
# case there's a need to. 
SCRIPT=1

# Readonly Environment Variable
# If -ro (readonly) is appended to the script command, then we also set an additional
# environment variable to signal that we want to connect to a readonly instance
# of the database
MONGO_RO=1

# Run Script:
# We finally run the execute-script.js
node /utils/run-script.js prod

We can run our script by using the command:

$ yarn run script-prod /scripts/myExampleScript.js

Defining run-script.js

Once we have our script commands set up, we can write our script runner. We first grab the environment and source path for the script from the arguments:

// run-script.js 

const env = process.argv[2];
const sourcePath = process.argv[3];

Saving Commit Messages and Metadata

To run a script against production, we require a commit message and saving metadata about the script run in our database.

By using commit messages, we can introduce accountability and traceability. We can look back at any given script run and determine who ran it and why it was run (e.g. “*Delete all tasks from Customer X”). *We also keep track of the commit hash checked out and the original source code of the script so we know how to exactly reproduce the script run if we need to.

To gather the information:

// run-script.js continued

// Request commit message
const rlp = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: true,
});
const reason = await rlp.questionAsync('Enter a commit message (leave blank for read-only): ');
rlp.close();

// Who is the git user running the script
const runBy = (await exec('git config user.email')).trim();

// What hash does the user have checked out
const gitHash = (await exec('git rev-parse HEAD')).trim();

// The source code of the script (saved to s3)
const rawSourceCode = fs.readFileAsync(sourcePath, 'utf8');
// Uploads to S3 and returns link to S3 resource
const sourceCode = await uploadTextToS3(rawSourceCode); 

And save the metadata to our DB:

const script = await Script.create({
  runBy,
  gitHash,
  reason,
  params: {
    argv: process.argv.slice(3), // The arguments passed to the script
    sourceCode,
    filePath: sourcePath,
    env,
  },
});

await script.save();

Run the Script

If we are not given a commit message, we run the script in read only mode. We do this by running the the same command, except with a -ro appended. This means the script is run against a read-only replica of our database. and cannot update or change any data. This is useful for adding safety when experimenting and allowing for dry runs.

// run-script.js continued

if(reason.trim().length === 0) {
    console.log("Running readonly script...");
    const child = spawn('yarn', ['run', `script-${env}-ro`, 
      ...process.argv.slice(3)], { stdio: 'inherit' });
    child.on('close', async (statusCode) => {
        process.exit(statusCode);
    });

    return;
}

If we are given a commit message, we allow running the script with write privileges, and also save the output of the logs to store for reference.

We pipe the output through tee so that we can save it to a file as well as view it on the screen (stdout)

const startTime = Date.now();
const logfile = `/tmp/${startTime}.log`

const nodeArgs = [
    sourcePath,
    ...process.argv.slice(4), // Arguments for the script
];

const spawnArgs = [
    'node',
    [nodeArgs, '|', 'tee', logfile],
    { stdio: 'inherit' }
]

Saving output to S3

We also want to save all log output in case we need to debug issues in the future. We save the output to S3 and store the link on our previously created script object. We can use the following S3 helper function:

const AWS = require('aws-sdk');

const uploadTextToS3 = async (fileData) => {
    const s3Client = new AWS.s3({
        params: { Bucket: 'scaleapi-script-logs'},
    });

    const req = await s3Client.upload({
        Key: `${uuid.v4()}`,
        ACL: 'bucket-owner-read',
        ContentType: 'application/json',
        CacheControl: 'max-age=86400',
        Body: fileData.toString(),
    }).promise();

    return req.Location;
}

And finally, upload the log file contents, delete the files using unlink, and update our script object with a link to the output and status code.

const saveScript = async (statusCode) => {
    const s3Link = await fs.readFileAsync(logfile).then(uploadTextToS3);
    fs.unlinkSync(logfile);
    
    script.result = {
        s3Link,
        statusCode,
    };
    script.finishedAt = new Date();
    await script.save();
}

Signal Handling

Finally, to monitor the child process and handle terminal input, we define the following handlers:

process.on('SIGINT', async () => {
    console.info('Got SIGINT. Saving script...');
    child.kill('SIGINT');
});

child.on('close', async (statusCode) => {
    await saveScript(statusCode);
    process.exit(statusCode);
});

process.on('SIGTERM', () => {
    console.info('Got SIGTERM. Waiting for jobs to complete...');
    process.exit(1);
});

Putting it together:

In the end, the framework boils down to two main files: scriptMain.js and run-script.js.

// scriptMain.js 

const run = async (f) => {
  try {
    console.log('Starting script...');
    await f();
    process.exit(0);
  } catch (err) {
    console.error(err.stack);
    process.exit(1);
  }
};

export default { run };
// run-script.js 

const { spawn, exec: _exec } = require('child_process');
const process = require('process');
const readline = require('readline-promise').default;
const fs = Promise.promisifyAll(require('fs'));
const exec = Promise.promisify(_exec);
const uuid = require('uuid');
const Script = require('../models/Script')

const uploadTextToS3 = async (fileData) => {
    const s3Client = new AWS.s3({
        params: { Bucket: 'scaleapi-script-logs'},
    });

    const req = await s3Client.upload({
        Key: `${uuid.v4()}`,
        ACL: 'bucket-owner-read',
        ContentType: 'application/json',
        CacheControl: 'max-age=86400',
        Body: fileData,
    }).promise();

    return req.Location;
}

async function go() {
    const env = process.argv[2];
    const sourcePath = process.argv[3];

    // Request commit message
    const rlp = readline.createInterface({
        input: process.stdin,
        output: process.stdout,
        terminal: true,
    });
    const reason = await rlp.questionAsync('Enter a commit message (leave blank for read-only): ');
    rlp.close();

    // If no commit message, then run in read-only mode 
    if(reason.trim().length === 0) {
        console.log("Running readonly script...");
        const child = spawn('yarn', ['run', `script-${env}-ro`, 
          ...process.argv.slice(3)], { stdio: 'inherit' });
        child.on('close', async (statusCode) => {
            process.exit(statusCode);
        });
    
        return;
    }
    
    // Who is the git user running the script
    const runBy = (await exec('git config user.email')).trim();
    
    // What hash does the user have checked out
    const gitHash = (await exec('git rev-parse HEAD')).trim();
    
    // The source code of the script (saved to s3)
    const rawSourceCode = await fs.readFileAsync(sourcePath, 'utf8');
    const sourceCode = await uploadTextToS3(rawSourceCode);

    // Save script run 
    const script = await Script.create({
      runBy,
      gitHash,
      reason,
      params: {
        argv: process.argv.slice(3), 
        sourceCode,
        filePath: sourcePath,
        env,
      },
    });
    
    await script.save();
    
    // Spawn script
    const startTime = Date.now();
    const logfile = `/tmp/${startTime}.log`
    
    const nodeArgs = [
        sourcePath,
        ...process.argv.slice(4), // Arguments for the script
    ];
    
    const spawnArgs = [
        'node',
        [nodeArgs, '|', 'tee', logfile],
        { stdio: 'inherit' }
    ]

    const child = spawn(...spawnArgs);
    
    // Save to S3 and update Script Object
    const saveScript = async (statusCode) => {
        const s3Link = await fs.readFileAsync(logfile)
          .then(uploadTextToS3);
        fs.unlinkSync(logfile);
        
        script.result = {
            s3Link,
            statusCode,
        };
        script.finishedAt = new Date();
        await script.save();
    }
    
    // Signal and Child process handling
    process.on('SIGINT', async () => {
        console.info('Got SIGINT. Saving script...');
        child.kill('SIGINT');
    });
    
    child.on('close', async (statusCode) => {
        await saveScript(statusCode);
        process.exit(statusCode);
    });
    
    process.on('SIGTERM', () => {
        console.info('Got SIGTERM. Waiting for jobs to complete...');
        process.exit(1);
    });
}

// Run script
go().catch(err => console.error(err.stack));

Helpful tricks and tips

Allow debugging and support for compiled scripts

We can allow an optional DEBUG environment variable that sets the inspect-brk flag when running node. This lets us use the chrome debugger to help fix issues with our script.

If you have a build system that outputs compiled files, we can also detect that it’s running in watch mode in the background and use the appropriate version of the script (compiled or source). We also have to check that none of the compiled files are outdated, just in case the build watch script is running behind.

const debug = process.env.DEBUG === 'true';

// Our compiled files have the same path, but under /dist-watch and with
// a .ts (typescript) extension instead of .js
const sourcePath = `${process.argv[3]}`;
const compiledPath = `dist-watch/${sourcePath}`.replace(/\.ts$/, '.js');

// Detect whether the build system is running in the background
// In our case, this is the build-watch command
const ps = await exec('ps aux');
let shouldRunCompiled = ps.indexOf('build-watch') !== -1; // true if build-watch is running

// Ensure none of the compiled files are out of date. If so, we might be 
// running old code so we should instead run directly from source.
// findOutOfDate returns all files that have a modified timestamps which 
// is later than modified timestamps of the corresponding compiled file. 
const outOfDateFiles = await findOutOfDate('/dist-watch');
if (outOfDateFiles.length > 0) {
shouldRunCompiled = false;
}

if (shouldRunCompiled) {
    console.log("Running pre-compiled script");
} else {
    console.log("Running slowly. Run build-watch to pre-compile dependencies");
}

const nodeArgs = [
    ...(debug && fast ? ['--inspect-brk=9229'] : []), // New: debug flag
    shouldRunCompiled? compiledPath: sourcePath, // New: compiled flag
    ...process.argv.slice(4),
];
    

File structure

To organize the hundreds of scripts in our codebase, we have a file structure like the following:

/scripts
|-- /users            # Personal workspaces for scripts
    |-- /sthupukari
    |-- /akshat
    |-- /stevenhao
|-- /payouts          # Public, tested scripts for use across the org
|-- /lidar

By having folders for each engineer, they have a space where they can build a collection of personal scripts that are useful to their workflow. Eventually, when the script becomes robust enough and helpful to others, they can move it to a public folder, signaling it’s ready to be used.