The key to getting people to at least appreciate logical data models is to just do them as part of whatever modeling effort you are working on. Don’t say “stop”. Just model on. Demonstrate, don’t tell your teams where the business requirements are written down, where they live. Then demonstrate how that leads to beautiful physical models as well.
This is a SQL-based introduction to the data and analysis behind the Wall Street Journal’s Pulitzer-winning “Medicare Unmasked” investigative project. It also doubles as a helpful guide if you’re attempting the midterm based on the WSJ Medicare’s investigation.
To follow along in this walkthrough, you can download my SQLite database here:
medicare_providers_2012.sqlite.zip Be warned: it is nearly 700MB zipped and expands to more than 2 gigabytes.
Accuracy of 90 percent with 80 percent consistency sounds good, but the scores are “actually very poor, since they are for an exceedingly easy case,” Amaral said in an announcement from Northwestern about the study.
Applied to messy, inconsistently scrubbed data from many sources in many formats – the base of data for which big data is often praised for its ability to manage – the results would be far less accurate and far less reproducible, according to the paper.
Here’s an interesting explanation as to how LDA, Latent Dinchlet Allocation works. From: What is a good explanation of Latent Dirichlet Allocation?
From a 3000 foot level as I understand the explanation of LDA; it seems like a mechanism to score words in order to categorize sets of words like paragraphs or entire papers. Interesting exercise but a human must data model this first. Any time some program has to estimate or guess like this there will be error, the only issue is how much is acceptable to even use the results that this kind of analysis produces.
A good predictive model requires a stable set of inputs with a predictable range of values that won’t drift away from the training set. And the response variable needs to remain of organizational interest.
If you want to move at the speed of “now, light, big data, thought, stuff,” pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don’t build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.
I’d like to show an example of anthropomorphism gone wrong. It was given to me as a classic justification of why so called “Object Oriented Programming” is better than procedural programming. You may have learned it in your first lesson about OOP.
(Note: I’m not disparaging OOP here, just the example. For genuine OOP bashing, see here.)
From slashdot comments that I found funny:
Lets say you’re a traveling auto salesman, and you would like to sell your cars to different stores around the state. You could either drive each car, one at a time, to each assigned destination and hitchhike back to your starting point (always with a towel). Or you could come up with an algorithm for taking all the cars, putting them into a truck, and finding the shortest path that visits each auto store, saving gas and giving you the street credibility to comment on the appropriateness of OOP vs procedural languages. Then, after having spent a more fulfilling life than most people by being so efficient, you can watch as people invoke your name, and come up with a poor analogy which doesn’t really explain OOP vs procedural languages that shows up on Slashdot.
Why the above was funny? See this wiki article on Dijkstra’s algorithm which the first quoted editorial used as a source:
Dijkstra’s algorithm, conceived by computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with non-negative edge path costs, producing a shortest path tree. This algorithm is often used in routing and as a subroutine in other graph algorithms.
The switch from relational hadn’t been too hard because Riak is a key-value store, which made modeling relatively easy. Key value-stores are relatively simple database management systems that store just pairs of keys and values.
McCaul reckoned, too, migration of data had been made possible because the structure of patient records lent themselves to Riak’s key-value mode
Baseball data, over 95% of which has been created over the last five years, will continue to mount—leading MLB decision-makers to invest in more powerful analytics tools. While there are plenty of business intelligence and database options, teams are now looking to supercomputing—or at least, the spawn of HPC—to help them gain the competitive edge.
Please. The problem with current baseball analytics isn’t the deluge of data, it’s the deluge of crackpot theories that add more and more irrelevant variables to the mix. Most baseball analytics misuse mathematics and created by people who are simply selling a website.
Speaking of selling a website; is this a good place to introduce the sister site to bucktownbell.com?
All data in above data model crunched using perl,awk, and bash on a standard PC. Baseball is not that complicated where it requires a supercomputer to crunch historical or current season data. More from the article…
He explained that what teams, just like governments and drug development researchers, are looking for is a “hypothesis machine” that will allow them to integrate multiple, deep data wells and pose several questions against the same data.
A recent post on Reactive Programming triggered discussions about what is and isn’t considered Reactive Logic. In fact, many have already discovered that Reactive Programming can help improve quality and transparency, reduce programming time and decrease maintenance. But for others, it raises questions like:
- How does Reactive differ from conventional event-oriented programming?
- Isn’t Reactive just another form of triggers?
- What kind of an improvement in coding can you expect using Reactive and why?
So to help clear things up, here is a real-life example that will show the power and long-term advantages Reactive offers. In this scenario, we’ll compare what it takes to implement business logic using Reactive Programming versus two different conventional procedural Programming models: Java with Hibernate and MySQL triggers.
“We were surprised to see that it was actually fairly difficult to use HealthCare.gov to find and understand our options,” he told CNN. “Given that the data was publicly available, we thought that it made a lot of sense to take the data that was on there and just make it easy to search through and view available plans.”
The result is a bare-bones site that lets users enter their zip code, plus details about their family and income, to find suggested plans in their area.
The site is here at www.thehealthsherpa.com and it seems pretty damn good!
Scientists like DeDeo and Vespignani make good use of this piecemeal approach to big data analysis, but Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway. It is not sufficient, he argues, to simply collect and store massive amounts of data; they must be intelligently curated, and that requires a global framework.
Among the most notable insights Euler gleaned from the puzzle was that the exact positions of the bridges were irrelevant to the solution; all that mattered was the number of bridges and how they were connected. Mathematicians now recognize in this the seeds of the modern field of topology.