Who are you, and what do you do?
I am Joseph Turian, the founder of the MetaOptimize. We consult on machine learning, natural language processing, and data science. We provide implementation of solutions, advice and coaching for in-house tech teams. We help organizations turn data into higher-level information.
We also have the largest ML and NLP forum on the web.
What is your technology stack?
We are not dogmatic, and will use whatever tools are the most effective.
Our weapon of choice is Python, backed by NumPy and SciPy. Support for scientific computing in Python is great, there are some good libraries out there. Most of the algorithms we use are proprietary. If we use existing approaches, we like scikit-learn for standard algorithms, and Vowpal Wabbit for fast training when prototyping. A lot of the time, we expose services through XML-RPC so that they can be called from any language. We do not use NLTK, since most of the models are not of high-enough accuracy.
When prototyping, fields evolve rapidly, and we use a MongoDB store. A common pattern is to do a processing pass over the rows, and add a new field for the output. For relational data, we use SQLalchemy with a PostgreSQL backend.
Effective and strategic use of crowd-computation is so crucial to our core work that I am going to include it in our stack, I will talk about our service providers below.
We Boilerpipe a lot for extracting text from webpages.
We use Ubuntu for run-of-the-mill systems, like EC2 instances which we tear down quickly. We use Gentoo for long-running experiment boxes (dedicated servers), because it’s much easier to get state-of-the-art packages and control compile options.
What software do you use to run your business?
Our experiments are typically memory-bound, so we rent >= 16GB dedis from Europe, where they are far less expensive. We have a dedicated experimental server from Kimsufi.ie, the “budget” line of OVH. The dedis from Hetzner are also supposedly very good, and a comparable pricepoint. 100ms latency is fine for an SSH shell. If we did want to serve APIs to the states with low latency, HoneLive is the provider we’d choose. We have 128 MB VPSes from buyvm lying around that we never use, but refuse to give them up because they cost $15/yr and are always out-of-stock. For large-scale on-demand computation, we spin up EC2 instances. Our backups are on S3.
For crowd computation, we control Mechanical Turk through CrowdFlower. We use human annotation for tasks that are too small to justify building a model, or when very high accuracy is needed to supplement an automatic method, or when creating training data to improve models.
We host public code on GitHub. We used to host private repos on GitHub, but now we host them on a private server (I had forgotten that we do not need our paid Github plan anymore, and just downgraded to the free GitHub plan). When collaborating with outsiders, we prefer to use BitBucket.
We use Google Apps for email. For sending transactional email from our forum, we use fastmail.fm’s SMTP server, but are considering migrating to SendGrid. Boomerang for Gmail is AWESOME. I can’t recommend it enough. We use it when we send an important email that needs a reply, to notify us if no reply has been sent. We would pay double for it. We do not have a mailing list, but should. We will probably use MailChimp.
We use RightSignature for executing contracts. The interface is very simple, and the signatures are great. If clients are puzzled and ask for the contract the old-fashioned way, we use UnityFax. The service gives us our own SF-area fax number, and we send and receive faxes through email. It is so damn cheap ($1.99/mo) and works fine. You have probably never heard of them, because they suck at branding and SEO.
For our phone, we use GrassHopper. Our number is 855-ALL-DATA. Because we don’t just work on big data, sometimes our clients have small data. We care about all data. We only discriminate between data when modeling conditional distributions.
We use Google Docs a lot for sharing and collaboration, and evolving documentation that clients might want to edit. We bill for projects, not time, so we don’t track billable hours. We use Indinero to trap all financial activity, which makes it easier to determine deductibles during tax season. For IM, we use gchat, but are migrating to Pidgin with the OTR plugin.
What business software do you most wish existed?
We don’t tweet, because a lot of the twitter stream is noise. Hence, it takes too long to consume tweets that we care about. However, we would like to use twitter to increase our brand recognition, engage potential collaborators and clients, demonstrate expertise, and establish our company as a thought-leader. Something that filtered the twitter stream so we could only see relevant tweets is something we would pay for.
We would like more sophisticated tools for doing crowd-programming. CrowdFlower is only good for one-shot tasks, not complicated pipelines. The TurKit library is promising, but I’d prefer something with more features like CrowdFlower, so that we would not have to build everything ourselves. CrowdControl is a new entrant into the space which appears to solve the problem correctly, but they are an enterprise solution and charge yearly, so it is not cost-effective for engagements that typically last several months.