Generate Lots of Test Data with CROSS JOIN
October 12, 2009 1 Comment
“Arrgh! There has got to be a better way to create some test data!”
Ever had that conversation with yourself? You need to create lots of data to load test an application or to create a demo. But going about it one row at a time is not only slow and tedious, but its results are less than stellar. …Joe Test101.
So, how can you create lots and lots of data without lots and lots of effort? Simple, let SQL do it for you.
Enter the Forgotten Join Type, the CROSS JOIN
Fortunately, SQL Server has a oft forgotten JOIN type that helps for occasions such as this. We’re all pretty familiar with INNER JOIN, LEFT JOIN, and RIGHT JOIN. But do you recall the most famous join type of them all? The CROSS JOIN?
A CROSS JOIN takes each row from the first table and combines it with every row from the second table. This produces a Cartesian product of the two tables.
For example, if my first table has 3 rows of data – Mary, Mark, Luke – and my second table has 2 rows of data – Jones, Smith – the Cartesian product will result in 6 rows of data. Mary Jones, Mary Smith, Mark Jones, Mark Smith, Luke Jones, and Luke Smith
The CROSS JOIN is much maligned. Most everyone who has ever written a SQL statement has accidentally produced a Cartesian product at some point. A query that you expected to return 12 hundred rows produced 12 billion rows. But we can harness that power for good and use it to create lots and lots of test data.
What’s in a Name?
The United States Census Bureau, that’s who! Their site has a list of the most popular male, female, and family names. And they make the list available in files for download.
When I last checked, there were a total of 5,494 Given Names and 88,799 Family Names in the three files (the Given Names were broken into two files, male and female).
Creating a Cartesian product of that many names would produce a whopping 487,861,706 unique names (88,799 * 5,494). That’s far more than I typically require for my testing. So, I’ve pruned the list down quite a bit in my environment, down to 2,500.
Using the CROSS JOIN
After importing the files, I used the following script to create my test data.
–clean up the imported names
Given_Name = RTRIM(Given_Name)
Family_Name = RTRIM(Family_Name)
–create a very simple customers table
CREATE TABLE Customers
Customer_ID INT NOT NULL IDENTITY(1,1)
,Last_Name VARCHAR(20) NOT NULL
,First_Name VARCHAR(20) NOT NULL
,Email_Address VARCHAR(50) NULL
–create the test data
,f.Given_Name + ‘.’ + l.Family_Name + ‘@hotmail.com’
First_Names AS f CROSS JOIN
Last_Names AS l
My new Customers table now has 13,735,000 rows in it. A real business should be so fortunate!